%matplotlib inline
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import scipy.stats as stats
import pandas as pd
import duckdb
from tabulate import tabulate
sns.set(style="whitegrid")
THEME = "darkslategray"
print(duckdb.__version__)
0.5.1
import warnings
warnings.filterwarnings('ignore')
In this project, we have to deal with different types of airbnb data from Q3 2021 to Q3 2022 fro Washington D.C. The data includes listings, reviews, calendar and neighbourhood. As a secondary dataset, we chose to use the crime data for the same time range in Washington D.C which we have introduced it in the README file.
In order to create the database for further steps of the project, we created this datawarehouse.py script to load the data into the ps6.duckdb. The datawarehouse was fully encapsulated as DatawareHouse class and initialize the connection to the database in the constructor. We have many different methods defined in the class, such as createAllListings, createNeighborhood, createReviews, createCalendar and createCrime. The main functionality for those methods are loading the CSV file for different types of data and create the table and insert the data into those tables. There are also some data cleaning in the methods. For example, we convert the text type price into a float type and also append the neighbourhood for each crime so that we can link crimes to neighbourhoods, and neighbourhoods to listings.
Other than that, we also created a couple views in the database to make the further analysis easier. We have latest_listings (one listing per property), last_scraped, neighborhood_crimes and full_latest_listings (one listing per property, joined to crime statistics for the neighbourhood). Those views are helpful when we did the work in the ASK step and EDA step.
The primary and secondary source data files must be stored in the AirBnB_WashDC subdirectory. The primary source compressed files must be extracted in the same subdirectory, resulting in detailed listings having an appended " 2" in the filename due to having the same name as the summary listing sources (e.g. 11Jun22 listings 2.csv).
After running the script, all the data needed for ASK, EDA and modeling steps will be loaded into the ps6.duckdb. You will see 6 tables and 4 views were created.
%run -i 'datawarehouse.py'
all_listing tables have been created neighborhoods table has been created calendar12_2021.csv calendar03_2022.csv calendar06_2022.csv calendar09_2022.csv calendar12_2021.csv cleaned calendar03_2022.csv cleaned calendar06_2022.csv cleaned calendar09_2022.csv cleaned calendar tables have been created crime tables have been created
Our team wanted to provide a pricing tool to individuals in Washington D.C. who are considering listing their homes on AirBnB. For the average host, pricing is a difficult process and understanding all the factors that impact pricing strategy can be difficult. Using the AirBnB listing data for the last four quarters, we wanted to understand which attributes had the most impact on pricing for a typical listing. A typical listing is defined as one with a price between $0-1000 per night, as that comprises 98% of all listings in Washington D.C.
An important consideration for pricing a short-term rental is location, we know that our hosts want to be competitive with other listings in their neighborhoods. Our team decided that to give our predictive model an edge in accuracy, we wanted to include crime as a secondary data source as we believed that could yield benefits to the pricing power of listings. We believe that with sufficient crime location data and using other information from existing listings, we will see that crime and location are two factors with a positive impact on the price of AirBnBs.
The target variable for our model is price and the secondary data set is crime data from the Washington, DC Metropolitan Police Department.
We endeavored to answer the following questions:
placeholder for update
With the data set in a convenient format, we will set about exploring the data. With price as the target variable, we will look at the following factors:
Our initial thoughts on relationships are shown here:

Given our experiences, we discussed the following relationships:
There is a loop in the CLD for review scores and price. As review scores go up, we assumed that the owner could charge more for a stay.
con = duckdb.connect(database='ps6.duckdb', read_only=True)
# Loading dataset
# connect to database
# read the result of an arbitrary SQL query to a Pandas DataFrame
all_listings = con.execute("SELECT * from all_listings").df()
all_listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28076 entries, 0 to 28075 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 28076 non-null int64 1 listing_url 28076 non-null object 2 scrape_id 28076 non-null int64 3 last_scraped 28076 non-null datetime64[ns] 4 name 28074 non-null object 5 description 27666 non-null object 6 neighborhood_overview 18547 non-null object 7 picture_url 28076 non-null object 8 host_id 28076 non-null int32 9 host_url 28076 non-null object 10 host_name 27905 non-null object 11 host_since 27905 non-null datetime64[ns] 12 host_location 27027 non-null object 13 host_about 17347 non-null object 14 host_response_time 27905 non-null object 15 host_response_rate 20726 non-null float64 16 host_acceptance_rate 22014 non-null float64 17 host_is_superhost 28061 non-null object 18 host_thumbnail_url 27905 non-null object 19 host_picture_url 27905 non-null object 20 host_neighbourhood 25862 non-null object 21 host_listings_count 27905 non-null float64 22 host_total_listings_count 27905 non-null float64 23 host_verifications 28076 non-null object 24 host_has_profile_pic 27905 non-null object 25 host_identity_verified 27905 non-null object 26 neighbourhood 18547 non-null object 27 neighbourhood_cleansed 28076 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 28076 non-null float64 30 longitude 28076 non-null float64 31 property_type 28076 non-null object 32 room_type 28076 non-null object 33 accommodates 28076 non-null int32 34 bathrooms 0 non-null float64 35 bathrooms_text 28036 non-null object 36 bedrooms 25340 non-null float64 37 beds 27307 non-null float64 38 amenities 28076 non-null object 39 price 28076 non-null float64 40 minimum_nights 28076 non-null int32 41 maximum_nights 28076 non-null int32 42 minimum_minimum_nights 28067 non-null float64 43 maximum_minimum_nights 28067 non-null float64 44 minimum_maximum_nights 28067 non-null float64 45 maximum_maximum_nights 28067 non-null float64 46 minimum_nights_avg_ntm 28067 non-null float64 47 maximum_nights_avg_ntm 28067 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 28076 non-null bool 50 availability_30 28076 non-null int32 51 availability_60 28076 non-null int32 52 availability_90 28076 non-null int32 53 availability_365 28076 non-null int32 54 calendar_last_scraped 28076 non-null datetime64[ns] 55 number_of_reviews 28076 non-null int32 56 number_of_reviews_ltm 28076 non-null int32 57 number_of_reviews_l30d 28076 non-null int32 58 first_review 21844 non-null datetime64[ns] 59 last_review 21844 non-null datetime64[ns] 60 review_scores_rating 21844 non-null float64 61 review_scores_accuracy 21649 non-null float64 62 review_scores_cleanliness 21651 non-null float64 63 review_scores_checkin 21645 non-null float64 64 review_scores_communication 21651 non-null float64 65 review_scores_location 21646 non-null float64 66 review_scores_value 21641 non-null float64 67 license 6918 non-null object 68 instant_bookable 28076 non-null bool 69 calculated_host_listings_count 28076 non-null int32 70 calculated_host_listings_count_entire_homes 28076 non-null int32 71 calculated_host_listings_count_private_rooms 28076 non-null int32 72 calculated_host_listings_count_shared_rooms 28076 non-null int32 73 reviews_per_month 21844 non-null float64 dtypes: bool(2), datetime64[ns](5), float64(26), int32(15), int64(2), object(24) memory usage: 13.9+ MB
The all_listings table includes every row from the detailed listing data from the Inside Airbnb data source for the Washington, D.C. area from December 2021 to September 2022. This includes information about the host, amenities, and review scores.
The source data included 4 quarters of listing data. If a listing spanned more than one quarter, it would show up more than once in this all_listings table.
In order to avoid counting a listing more than once, we can use the latest_listings table:
latest_listings = con.execute("select * from latest_listings;").df()
latest_listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10560 entries, 0 to 10559 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10560 non-null int64 1 listing_url 10560 non-null object 2 scrape_id 10560 non-null int64 3 last_scraped 10560 non-null datetime64[ns] 4 name 10559 non-null object 5 description 10416 non-null object 6 neighborhood_overview 6559 non-null object 7 picture_url 10560 non-null object 8 host_id 10560 non-null int32 9 host_url 10560 non-null object 10 host_name 10398 non-null object 11 host_since 10398 non-null datetime64[ns] 12 host_location 9548 non-null object 13 host_about 6013 non-null object 14 host_response_time 10398 non-null object 15 host_response_rate 7452 non-null float64 16 host_acceptance_rate 8120 non-null float64 17 host_is_superhost 10554 non-null object 18 host_thumbnail_url 10398 non-null object 19 host_picture_url 10398 non-null object 20 host_neighbourhood 9583 non-null object 21 host_listings_count 10398 non-null float64 22 host_total_listings_count 10398 non-null float64 23 host_verifications 10560 non-null object 24 host_has_profile_pic 10398 non-null object 25 host_identity_verified 10398 non-null object 26 neighbourhood 6559 non-null object 27 neighbourhood_cleansed 10560 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 10560 non-null float64 30 longitude 10560 non-null float64 31 property_type 10560 non-null object 32 room_type 10560 non-null object 33 accommodates 10560 non-null int32 34 bathrooms 0 non-null float64 35 bathrooms_text 10546 non-null object 36 bedrooms 9600 non-null float64 37 beds 10291 non-null float64 38 amenities 10560 non-null object 39 price 10560 non-null float64 40 minimum_nights 10560 non-null int32 41 maximum_nights 10560 non-null int32 42 minimum_minimum_nights 10557 non-null float64 43 maximum_minimum_nights 10557 non-null float64 44 minimum_maximum_nights 10557 non-null float64 45 maximum_maximum_nights 10557 non-null float64 46 minimum_nights_avg_ntm 10557 non-null float64 47 maximum_nights_avg_ntm 10557 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 10560 non-null bool 50 availability_30 10560 non-null int32 51 availability_60 10560 non-null int32 52 availability_90 10560 non-null int32 53 availability_365 10560 non-null int32 54 calendar_last_scraped 10560 non-null datetime64[ns] 55 number_of_reviews 10560 non-null int32 56 number_of_reviews_ltm 10560 non-null int32 57 number_of_reviews_l30d 10560 non-null int32 58 first_review 7740 non-null datetime64[ns] 59 last_review 7740 non-null datetime64[ns] 60 review_scores_rating 7740 non-null float64 61 review_scores_accuracy 7656 non-null float64 62 review_scores_cleanliness 7657 non-null float64 63 review_scores_checkin 7654 non-null float64 64 review_scores_communication 7657 non-null float64 65 review_scores_location 7655 non-null float64 66 review_scores_value 7652 non-null float64 67 license 3427 non-null object 68 instant_bookable 10560 non-null bool 69 calculated_host_listings_count 10560 non-null int32 70 calculated_host_listings_count_entire_homes 10560 non-null int32 71 calculated_host_listings_count_private_rooms 10560 non-null int32 72 calculated_host_listings_count_shared_rooms 10560 non-null int32 73 reviews_per_month 7740 non-null float64 dtypes: bool(2), datetime64[ns](5), float64(26), int32(15), int64(2), object(24) memory usage: 5.2+ MB
Let's start with an investigation of AirBnB price. Price is an integer variable that corresponds to the "daily price in local currency" of the listing. According to [https://www.alltherooms.com/analytics/average-airbnb-prices-by-city/], the average price per night for AirBnBs in DC is $147 in 2021. We expect the numbers for 2022 to be slightly higher. I also expect the distribution to be mostly symmetric and unimodal about the mean. I think this because of the number of datapoints we have being so high that the CLT states that the distribution will approach the Normal distribution.
When we load the listings table into the listings dataframe without altering the data at all, we must be mindful of the repeat data. For single variable EDA, it is probably ok for us to leave the duplicates in because we don't expect listings' price to change to much from quarter to quarter. When we start looking at pairwise EDA, we might need to be careful about duplicate listings when we start looking at things like counts per neighborhood and average price per neighborhood.
all_listings["price"].describe()
count 28076.000000 mean 188.662594 std 384.123697 min 0.000000 25% 85.000000 50% 125.000000 75% 200.000000 max 24999.000000 Name: price, dtype: float64
We have just over 28,000 data points for price. The mean price per night was \$188 which is slightly above the average price from 2021 (as is expected). The median price per night is much lower at 125 which means we likely have a skew to the right with high priced AirBnBs skewing the mean. The minimum price being 0 dollars is suspicious as is the maximum price of 24,000 dollars per night.
con.execute("SELECT DISTINCT id, name, price from all_listings WHERE price < 5")
low_cost = list(con.fetchall())
print(tabulate(low_cost, headers=["id", "name", "price"], tablefmt='fancy_grid'))
╒══════════╤════════════════════════════════╤═════════╕ │ id │ name │ price │ ╞══════════╪════════════════════════════════╪═════════╡ │ 42738808 │ Capital View Hostel │ 0 │ ├──────────┼────────────────────────────────┼─────────┤ │ 43036130 │ U Street Capsule Hostel │ 0 │ ├──────────┼────────────────────────────────┼─────────┤ │ 46253554 │ citizenM Washington DC Capitol │ 0 │ ├──────────┼────────────────────────────────┼─────────┤ │ 43301430 │ Riggs Washington DC │ 0 │ ├──────────┼────────────────────────────────┼─────────┤ │ 42065771 │ The LINE Hotel DC │ 0 │ ├──────────┼────────────────────────────────┼─────────┤ │ 43308773 │ Viceroy Washington DC │ 0 │ ╘══════════╧════════════════════════════════╧═════════╛
con.execute("SELECT DISTINCT id, name, price from all_listings WHERE price > 5000")
low_cost = list(con.fetchall())
print(tabulate(low_cost, headers=["id", "name", "price"], tablefmt='fancy_grid'))
╒════════════════════╤════════════════════════════════════════════════════╤═════════╕ │ id │ name │ price │ ╞════════════════════╪════════════════════════════════════════════════════╪═════════╡ │ 14507861 │ Entire Capitol Hill Home - 5BR/4BA │ 5995 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 46004444 │ Yours Truly DC, 2 Bedroom Master Suite │ 10000 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 614471937104927680 │ NEW Listing! Unique House+Garden Rental, sleeps 40 │ 7500 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 8303678 │ Vista 2 Bedroom Rowhome FoggyBottom │ 6000 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 8784458 │ Spacious condo in NW, DC │ 10000 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 15054700 │ Historic Georgetown Residence │ 10000 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 47965462 │ Posh 2 Bed/2 Bath near Boiling AFB/DC │ 8000 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 48029085 │ The Dupont Diamond │ 24999 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 49613123 │ The Rosa Parks Safehouse │ 6500 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 50152089 │ Glover Park Hotel Georgetown-King with Kitchenette │ 9999 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 50154046 │ Glover Park Hotel- 2 Queen studio with Kitchenette │ 9228 │ ├────────────────────┼────────────────────────────────────────────────────┼─────────┤ │ 53042165 │ Amazing 2BR Condo @Mount Vernon Triangle w/Rooftop │ 8000 │ ╘════════════════════╧════════════════════════════════════════════════════╧═════════╛
There are 6 properties with a nightly price of 0 USD and the 1 AirBnB with a price above 10,000 USD is the 25,000 USD listing called "The Dupont Diamond" - a diamond indeed! There are also a handful of properties above 5000 and/or equal to 10,000. It might be worth removing these listings from the dataset.
## Code taken from Lab 5 solution
def freeman_diaconis( data):
quartiles = stats.mstats.mquantiles( data, [0.25, 0.5, 0.75])
iqr = quartiles[2] - quartiles[ 0]
n = len( data)
h = 2.0 * (iqr/n**(1.0/3.0))
return int( h)
h = freeman_diaconis(all_listings.price)
plot_data = all_listings.price
variable_name = "price"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, h)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data,bins=bins, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 7
Yes - let's get rid of the 25000 listing and see if the histogram starts to make more sense
trimmed_listings = all_listings[all_listings.price < 1500]
plot_data = trimmed_listings.price
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Removing listings below 1500 reveals a distribution with a very strong skew to the right. The distribution appears to be almost exponential which would not make much sense given our domain knowledge. Let's use Freeman Diaconis bins and limit the listings from 0-500
listings_below_500 = all_listings[all_listings.price < 500]
h = freeman_diaconis(listings_below_500.price)
plot_data = listings_below_500.price
variable_name = "price"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, h)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data,bins=bins, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 7
Looking at the above distribution, we see a that there is a strong skew to the right even when we remove all listings over 500 per night. This can be interpreted as the majority of AirBnBs having a nightly price that is between 85 and 200 per night but then increasingly expensive houses becomes less and less popular.
Moving on to the single variable analysis of neighborhoods...
The neighborhood variable is the string name of the DMV neighborhood in which the AirBnB is located. The dataset description says that the AirBnBs in this dataset are all located within D.C. so we expect the neighborhood names to be all neighborhoods within the actual district. As for how many neighborhoods there will be, that depends on how granular the neighborhood boundaries are. It would make sense for AirBnB to default to some governmental agency to define the neighborhood boundaries because neighorhood boundaries are highly subjective and constantly changing in the colloquial sense. For that reason, I expect the number of neighborhoods to be somewhere between 20 and 40.
# read the result of an arbitrary SQL query to a Pandas DataFrame
neighborhoods = con.execute("SELECT * from neighborhoods").df()
neighborhoods.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 39 entries, 0 to 38 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 neighbourhood_group 0 non-null object 1 neighbourhood 39 non-null object 2 type_outer 39 non-null object 3 feature_type 39 non-null object 4 geometry_type 39 non-null object 5 coordinates 39 non-null object dtypes: object(6) memory usage: 2.0+ KB
print(neighborhoods.neighbourhood)
0 Brightwood Park, Crestwood, Petworth 1 Brookland, Brentwood, Langdon 2 Capitol Hill, Lincoln Park 3 Capitol View, Marshall Heights, Benning Heights 4 Cathedral Heights, McLean Gardens, Glover Park 5 Cleveland Park, Woodley Park, Massachusetts Av... 6 Colonial Village, Shepherd Park, North Portal ... 7 Columbia Heights, Mt. Pleasant, Pleasant Plain... 8 Congress Heights, Bellevue, Washington Highlands 9 Deanwood, Burrville, Grant Park, Lincoln Heigh... 10 Douglas, Shipley Terrace 11 Downtown, Chinatown, Penn Quarters, Mount Vern... 12 Dupont Circle, Connecticut Avenue/K Street 13 Eastland Gardens, Kenilworth 14 Edgewood, Bloomingdale, Truxton Circle, Eckington 15 Fairfax Village, Naylor Gardens, Hillcrest, Su... 16 Friendship Heights, American University Park, ... 17 Georgetown, Burleith/Hillandale 18 Hawthorne, Barnaby Woods, Chevy Chase 19 Historic Anacostia 20 Howard University, Le Droit Park, Cardozo/Shaw 21 Ivy City, Arboretum, Trinidad, Carver Langston 22 Kalorama Heights, Adams Morgan, Lanier Heights 23 Lamont Riggs, Queens Chapel, Fort Totten, Plea... 24 Mayfair, Hillbrook, Mahaning Heights 25 Near Southeast, Navy Yard 26 North Cleveland Park, Forest Hills, Van Ness 27 North Michigan Park, Michigan Park, University... 28 River Terrace, Benning, Greenway, Dupont Park 29 Shaw, Logan Circle 30 Sheridan, Barry Farm, Buena Vista 31 Southwest Employment Area, Southwest/Waterfron... 32 Spring Valley, Palisades, Wesley Heights, Foxh... 33 Takoma, Brightwood, Manor Park 34 Twining, Fairlawn, Randle Highlands, Penn Bran... 35 Union Station, Stanton Park, Kingman Park 36 West End, Foggy Bottom, GWU 37 Woodland/Fort Stanton, Garfield Heights, Knox ... 38 Woodridge, Fort Lincoln, Gateway Name: neighbourhood, dtype: object
There are 38 total neighbrohoods in this dataset. They appear to encompass multiple colloquially-known neighborhoods given that their names almost all consist of multiple names combined in a comma-separated list. They appear in alphabetical order and the index on right right column of the above print-out will correspond to some labels I use below.
The neighborhoods table also includes a polygon shape of the neighborhood using lat/lon points. Let's map those:
!pip install geopandas
Requirement already satisfied: geopandas in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (0.12.1) Requirement already satisfied: pyproj>=2.6.1.post1 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from geopandas) (3.4.0) Requirement already satisfied: fiona>=1.8 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from geopandas) (1.8.22) Requirement already satisfied: shapely>=1.7 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from geopandas) (1.8.5.post1) Requirement already satisfied: packaging in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from geopandas) (21.3) Requirement already satisfied: pandas>=1.0.0 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from geopandas) (1.4.4) Requirement already satisfied: click>=4.0 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (8.1.3) Requirement already satisfied: click-plugins>=1.0 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (1.1.1) Requirement already satisfied: setuptools in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (65.3.0) Requirement already satisfied: attrs>=17 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (22.1.0) Requirement already satisfied: munch in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (2.5.0) Requirement already satisfied: certifi in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (2022.6.15.1) Requirement already satisfied: cligj>=0.5 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (0.7.2) Requirement already satisfied: six>=1.7 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from fiona>=1.8->geopandas) (1.16.0) Requirement already satisfied: python-dateutil>=2.8.1 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from pandas>=1.0.0->geopandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from pandas>=1.0.0->geopandas) (2022.2.1) Requirement already satisfied: numpy>=1.21.0 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from pandas>=1.0.0->geopandas) (1.23.3) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/davdev/opt/anaconda3/envs/en685648/lib/python3.10/site-packages (from packaging->geopandas) (3.0.9)
import geopandas as gpd
from shapely.geometry import Polygon
p_list = []
for p in neighborhoods.coordinates:
p_list.append(Polygon(p))
neighborhoods_gpd = gpd.GeoDataFrame(neighborhoods, crs='epsg:4326', geometry=p_list)
neighborhoods_gpd['rep_point'] = neighborhoods_gpd['geometry'].apply(lambda x: x.representative_point().coords[:])
neighborhoods_gpd['rep_point'] = [coords[0] for coords in neighborhoods_gpd['rep_point']]
neighborhoods_gpd['alpha_index'] = list(range(1,len(neighborhoods_gpd.neighbourhood)+1))
neighborhoods_gpd.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 39 entries, 0 to 38 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 neighbourhood_group 0 non-null object 1 neighbourhood 39 non-null object 2 type_outer 39 non-null object 3 feature_type 39 non-null object 4 geometry_type 39 non-null object 5 coordinates 39 non-null object 6 geometry 39 non-null geometry 7 rep_point 39 non-null object 8 alpha_index 39 non-null int64 dtypes: geometry(1), int64(1), object(7) memory usage: 2.9+ KB
fig,ax = plt.subplots(figsize = (15,15))
neighborhoods_gpd.plot(ax=ax)
for idx, row in neighborhoods_gpd.iterrows():
plt.annotate(row['alpha_index'], xy=row['rep_point'],
horizontalalignment='center')
plt.show()
From the above map, we can see that all of these neighborhoods exist within the federal borders of the District of Columbia (up until now, we were unsure if these neighborhoods bled into the surrounding DMV area). The polygons are of diverse shapes and sizes.
The white spots in the map correspond to neighborhoods clusters that are not represented in the AirBnB dataset. These are:
Source: https://opendata.dc.gov/datasets/neighborhood-clusters/explore?location=38.855725%2C-77.000257%2C11.77.
We now move on to single variable EDA on variables that deal with data about the host of the AirBnB.
latest_hosts_unique = latest_listings.drop_duplicates(subset='host_id')
We want to investigate the relationship between neighborhoods and the booked price of airbnbs in their area. We will start with individual variable analysis of price and neighborhoods and then do a pairwise analyis. To start this, we will load the data from the listings table and the neighborhoods table into their own Pandas dataframes to make them easier to work with
We also want the relationship between price and data that relates to the host like
host_sincehost_locationhost_response_timehost_response_ratehost_acceptance_ratehost_is_superhosthost_listings_counthost_verificationshost_has_profile_pichost_identity_verifiedhost_since ¶This variable corresponds to the date on which the host became an AirBnB host. It is saved as a Pandas datetime. We can use this data to understand how many day/months/years each person has been a host. AirBnB is only about 14 years old so we expect the maximum number of years someone has been a host to be less than 14. Between 0 and 14 I expect most people to have been hosts between 0 and 7 years and fewer between 7 and 14 which would cause a skew to the right.
import datetime
timestamp = pd.Timestamp(datetime.datetime(2021, 10, 10))
res = timestamp.today()
latest_hosts_unique.days_host = (res - latest_hosts_unique.host_since) / np.timedelta64(1,'D')
latest_hosts_unique.years_host = (res - latest_hosts_unique.host_since) / np.timedelta64(1,'Y')
latest_hosts_unique.days_host
0 5128.610204
1 4384.610204
2 5112.610204
3 3880.610204
4 4258.610204
...
10531 272.610204
10547 1650.610204
10549 1038.610204
10550 1565.610204
10554 335.610204
Name: host_since, Length: 5843, dtype: float64
latest_hosts_unique.days_host.describe()
count 5840.000000 mean 2513.449759 std 986.443234 min 93.610204 25% 2019.610204 50% 2588.610204 75% 3161.860204 max 5133.610204 Name: host_since, dtype: float64
latest_hosts_unique.years_host.describe()
count 5840.000000 mean 6.881592 std 2.700790 min 0.256296 25% 5.529505 50% 7.087374 75% 8.656879 max 14.055347 Name: host_since, dtype: float64
The minimum number of years is 0.23 years and the maximum is just over 14 which tracks. The mean is just under 7 years.
h = freeman_diaconis(latest_hosts_unique.years_host.dropna())
plot_data = latest_hosts_unique.years_host.dropna()
variable_name = "Number of Years Being a Host"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data,bins=bins, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 0
This distribution is mostly symmetric with a slight skew to the left. The majority of the data is etween 6 and 9 years. Few hosts have been hosts for less than 4 years ore more than 10.5
host_location ¶This variable is a categorical string variable indicating where the host resides. I don't know if this data is self-reported or chosen from a selection. I would expect most hosts to live in D.C. because they own property within D.C.
latest_listings.host_location.value_counts(normalize=True)[:10]
Washington, DC 0.439045 Washington, District of Columbia, United States 0.227377 Houston, Texas, United States 0.062631 US 0.035714 New York, NY 0.031211 United States 0.012882 Arlington, VA 0.008065 Silver Spring, MD 0.007331 New York, New York, United States 0.006075 Alexandria, Virginia, United States 0.005237 Name: host_location, dtype: float64
data = latest_listings.host_location.value_counts(normalize=True)[:10]
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_location.value_counts()[:10]).values()))))
axes.set_xticklabels(dict(latest_listings.host_location.value_counts()[:10]).keys())
axes.set_title( "Distribution of Host Locations")
axes.set_xlabel("Location")
axes.set_xticklabels(dict(latest_listings.host_location.value_counts()[:10]).keys(), rotation=90, ha='right')
axes.set_ylabel( "Percent")
axes.xaxis.grid(False)
plt.show()
plt.close()
This data is not very useful. There are overlapping and redundant categories - namely there is bboth a "Washington, District of Columbia, United States" AND a "Washington, DC" category. There is also a "US" and "United States" category which might mean these people live in D.C. or might mean anywhere else within the United States.
Regardless, the majority of hosts live within DC.
host_response_time ¶Host response time is the AirBnB autogenerated cateogorical variable assigned to each host based on their response time to customer messages. These categories include things like "within an hour" or "within a couple of days". I expect most hosts to respond "within a day" or "within a few hours" because there is an advantage in getting customer to trust/like you if you respond faster.
latest_listings.host_response_time.value_counts()
within an hour 5525 N/A 2946 within a few hours 1078 within a day 657 a few days or more 192 Name: host_response_time, dtype: int64
data = latest_listings.host_response_time.value_counts()
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_response_time.value_counts()).values()))))
axes.set_title( "Distribution of Response Time")
axes.set_xlabel("Location")
axes.set_xticklabels(dict(latest_listings.host_response_time.value_counts()).keys(), rotation=90, ha='right')
axes.set_ylabel( "Number of Listings")
axes.xaxis.grid(False)
plt.show()
plt.close()
There is a sizable portion of hosts within the "N/A" category. That aside, the clear majority of remaining hosts respond within the hour with subsequent categories having a decreasing frequency in order of length of time to respond.
host_response_rate ¶Host response rate is the percentage of messages from customers that hosts respond to at all, regardless of time. This will be on a range of 0-100. I expect the mean number to be greater than 90% because hosts have an incentive to respond to every message from a potential customer in order to gain their business and current customers to maintain a high rating.
latest_listings['host_response_rate'] = latest_listings['host_response_rate'].replace('N/A',np.NaN)
latest_listings['host_response_rate'] = latest_listings['host_response_rate'].replace('%', '', regex=True).astype(float)
latest_listings['host_response_rate'].describe()
count 7452.000000 mean 95.040794 std 15.631311 min 0.000000 25% 100.000000 50% 100.000000 75% 100.000000 max 100.000000 Name: host_response_rate, dtype: float64
h = freeman_diaconis(latest_listings['host_response_rate'].dropna())
plot_data = latest_listings['host_response_rate'].dropna()
variable_name = "host_response_rate"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data,bins=bins, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 0
The distribution is clearly skewed to the right. Almost ll the data is etween 80 and 100%. There is another mode at 0% which is presumably from rows within listings that have null values and/or hosts that are completely inactive on the website but do still have accounts as hosts.
host_acceptance_rate ¶Host acceptance rate is the rate of customer bids that this host accepts from 0-100. I expect this number to be high (>80%) because hosts have a monetary incentive to accept as many guests as possible throughout the year. I don't expect the IQR to be 100% like the last variable because hosts cannot say yes to everyone - especially for high-demand properties.
latest_listings['host_acceptance_rate'] = latest_listings['host_acceptance_rate'].replace('N/A',np.NaN)
latest_listings['host_acceptance_rate'] = latest_listings['host_acceptance_rate'].replace('%', '', regex=True).astype(float)
latest_listings['host_acceptance_rate'].describe()
count 8120.000000 mean 85.751108 std 24.362805 min 0.000000 25% 85.000000 50% 97.000000 75% 100.000000 max 100.000000 Name: host_acceptance_rate, dtype: float64
The mean acceptance rate is 85%. The median is 97% indicating that, once again, the mean acceptance rate is being skewed by a handful of low outliers - likely hosts with a zero acceptance rate.
h = freeman_diaconis(latest_listings['host_acceptance_rate'].dropna())
plot_data = latest_listings['host_acceptance_rate'].dropna()
variable_name = "host_acceptance_rate"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data,bins=bins, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 1
this distribution is highly skewed to the left once again. There are a handful of hosts with a 0% acceptance rate. This is likely from rows in listings that have all null values or from inactive hosts that still have desired property listings but never accept new guests.
host_is_superhost ¶This variable is a boolean variable indicating whether or not this host is a superhost. According to AirBnB, superhost status requires:
Based on past EDA and domain knowledge, these requirements are difficult to achieve. I expect there to be more non-superhosts than superhosts.
latest_listings.host_is_superhost.value_counts()
False 7950 True 2604 Name: host_is_superhost, dtype: int64
Only a quarter of hosts are superhosts.
data = latest_listings.host_is_superhost.value_counts(normalize=True)
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_is_superhost.value_counts()).values()))))
axes.set_title( "Distribution of Superhost Y/N")
axes.set_xlabel("Is Superhost")
axes.set_xticklabels(dict(latest_listings.host_is_superhost.value_counts()).keys())
axes.set_ylabel( "Proportion")
axes.xaxis.grid(False)
plt.show()
plt.close()
75% of hosts within the last quarter are not superhosts, whereas 25% of them are.
host_listings_count ¶This variable counts the number of listings that this host has within AirBnB. I expect this number to range from 1 to 25 with most hosts only having between 1 and 3 properties.
latest_listings['host_listings_count'].describe()
count 10398.000000 mean 165.797750 std 676.694138 min 0.000000 25% 1.000000 50% 2.000000 75% 10.000000 max 4057.000000 Name: host_listings_count, dtype: float64
There are a couple of clear outliers and/or impossibilities. There should not be any hosts in the AirBnB database with 0 listings. The maximum number being >4000 seems to be an extreme outlier given that the IQR is between 1 and 10 listings. The mean is being heavily skewed by these high outliers.
# h = freeman_diaconis(latest_listings['host_listings_count'].dropna())
# plot_data = latest_listings['host_listings_count'].dropna()
# variable_name = "host_listings_count"
# print("Freeman Diaconis: ", h)
# mn = int(plot_data.min())
# mx = int(plot_data.max())
# bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Clearly, most of the data is less than 500, and likely most of the data is less than 20. Let's chop off some of the very high values.
latest_listings.host_listings_count_under_500 = latest_listings.host_listings_count[latest_listings.host_listings_count < 500]
h = freeman_diaconis(latest_listings.host_listings_count_under_500.dropna())
plot_data = latest_listings.host_listings_count_under_500.dropna()
variable_name = "host_listings_count"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 0
There are a handful of hosts within the 50 to 250 range but clearly most of the data is less than 50.
latest_listings.host_listings_count_under_100 = latest_listings.host_listings_count[latest_listings.host_listings_count < 100]
h = freeman_diaconis(latest_listings.host_listings_count_under_100.dropna())
plot_data = latest_listings.host_listings_count_under_100.dropna()
variable_name = "host_listings_count"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 0
More filtering...
latest_listings.host_listings_count_under_20 = latest_listings.host_listings_count[latest_listings.host_listings_count < 20]
h = freeman_diaconis(latest_listings.host_listings_count_under_20.dropna())
plot_data = latest_listings.host_listings_count_under_20.dropna()
variable_name = "host_listings_count"
print("Freeman Diaconis: ", h)
mn = int(plot_data.min())
mx = int(plot_data.max())
bins = [i for i in range( mn, mx, 1)]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(plot_data, color="darkslategray")
axes.set_title(variable_name + " distribution - Freeman Diaconis")
axes.set_xlabel(variable_name)
plt.show()
plt.close()
Freeman Diaconis: 0
When we look only at data between 0 and 20, we see that there is a clear exponential distribution of this variable. Most people have 1 listing. Some people have between 2 and 4. And then a handful of people have vetween 4 and 20 in a decreasing fashion.
host_verifications ¶This variable is a list of all the ways in which the host's identity has been verified. The popular avenues include things like "email" and "phone" via verification code but hosts also have the option of verifying via government IDs or any combination of things.
Because this variable is a list, I expect there to be hundres if not thousands of unique host_verification values. I expect "email" and "phone" to be the main avenues of verification because of its ease.
latest_listings.host_verifications.value_counts(normalize=True)[:10]
['email', 'phone'] 0.550095 ['email', 'phone', 'work_email'] 0.136080 ['phone'] 0.051705 ['email', 'phone', 'reviews', 'kba'] 0.024716 ['email', 'phone', 'reviews', 'jumio', 'government_id'] 0.016004 None 0.015341 ['email', 'phone', 'reviews'] 0.014583 ['email', 'phone', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] 0.010511 ['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'government_id'] 0.010133 ['email', 'phone', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] 0.009280 Name: host_verifications, dtype: float64
There were too many unique host_verification values to display so I displayed the 10 most common ones. ['email', 'phone'] were the most common at 55%. The second most common was just ['phone']. After that, no combination of host_verification (though all including 'email' and 'phone' except for None) make up more than ~5% of the hosts.
data = latest_listings.host_verifications.value_counts(normalize=True)[:10]
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_verifications.value_counts(normalize=True)[:10]).values()))))
axes.set_title( "Host Verifications")
axes.set_xlabel("Modes of Verifications")
axes.set_xticklabels(dict(latest_listings.host_verifications.value_counts(normalize=True)[:10]).keys(),rotation=90)
axes.set_ylabel( "Percent")
axes.xaxis.grid(False)
plt.show()
plt.close()
This chart shows the clear majority of host_verification being both email and phone.
host_has_profile_pic ¶This is a boolean variable indicating whether or not the host has a profile picture. According to AirBnB, all hosts are required to have a profile picture so I expect this variable to be almost 100% yes. Although there are null values in this dataset or junk rows that may present as "no"s.
latest_listings.host_has_profile_pic.value_counts()
True 10325 False 73 Name: host_has_profile_pic, dtype: int64
As expected, almost all hosts have a profile picture.
data = latest_listings.host_has_profile_pic.value_counts(normalize=True)
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_has_profile_pic.value_counts()).values()))))
axes.set_title( "Has Profile Picture Y/N")
axes.set_xlabel("Has Profile Picture")
axes.set_xticklabels(dict(latest_listings.host_has_profile_pic.value_counts()).keys())
axes.set_ylabel( "Proportion")
axes.xaxis.grid(False)
plt.show()
plt.close()
Almost 100% of hosts have a profile picture and only 73 do not.
host_identity_verified ¶This variable is another boolean indicating whether the host's identity has been verified. AirBnB's process for host identification involves hosts providing their name, date of birth, or government ID.
I would expect a clear majority of hosts to have their identity verified because they have an incentive to build trust with potential customers because that would ostensibly make customers more likely to book a listing with them.
latest_listings.host_identity_verified.value_counts()
True 8675 False 1723 Name: host_identity_verified, dtype: int64
data = latest_listings.host_identity_verified.value_counts(normalize=True)
x = range( len(data))
width = 1/1.5
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.bar(x, data, width, align="center", color="darkslategray")
axes.set_xticks(list(range(len(dict(latest_listings.host_identity_verified.value_counts()).values()))))
axes.set_title( "Host Identity Verified Y/N")
axes.set_xlabel("Identity Verified")
axes.set_xticklabels(dict(latest_listings.host_identity_verified.value_counts()).keys())
axes.set_ylabel( "Proportion")
axes.xaxis.grid(False)
plt.show()
plt.close()
Almost 85% of hosts have had their identity verified and just over 15% of hosts have not - as expected.
The following room descriptors give information about the rental property. How many rooms, bathrooms, and beds does it have? How many people can it accomodate? All things being equal, the more of these will mean a higher price to rent.
Target Variable
For the following exploration, we will use the latest_listings table, which includes only one of each listing, taking the most recent data if there were duplicates in the data set.
First, we'll bring in the amenities information and look at it from a high level
con.execute("select property_type, room_type, accommodates, bathrooms, bathrooms_text, bedrooms, beds, amenities, price from latest_listings;")
listings = con.fetchall()
room_columns = ['property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price']
listings_df = pd.DataFrame(listings, columns=room_columns)
listings_df.head()
| property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | amenities | price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Private room in home | Private room | 1 | None | 1 private bath | 1.0 | 1.0 | ["Dryer", "Smoke alarm", "Cooking basics", "In... | 67.0 |
| 1 | Private room in rental unit | Private room | 1 | None | 2 shared baths | 1.0 | 1.0 | ["Smoke alarm", "Cooking basics", "Bed linens"... | 39.0 |
| 2 | Private room in townhouse | Private room | 3 | None | 1 private bath | 1.0 | 2.0 | ["Dryer", "Smoke alarm", "Single level home", ... | 75.0 |
| 3 | Entire condo | Entire home/apt | 2 | None | 1.5 baths | 1.0 | 1.0 | ["Dryer", "TV with standard cable", "Cooking b... | 250.0 |
| 4 | Private room in guest suite | Private room | 3 | None | 1 private bath | 1.0 | 2.0 | ["Dryer", "Smoke alarm", "Paid parking on prem... | 113.0 |
trimmed_listings = listings_df[listings_df.price > 0]
First, we'll look at what this column describes.
Queries:
listings_df.property_type.describe()
count 10560 unique 61 top Entire rental unit freq 3488 Name: property_type, dtype: object
There are 61 distinct values for this column, with the most frequent item being "Entire rental unit."
Next, we'll look at the distribution of each category. Since there are 61 of these, we'll rely on a table to parse the info.
con.execute("select property_type, count(property_type) as prop_count from latest_listings group by property_type order by prop_count desc;")
distinct_property_types = con.fetchall()
print(tabulate(distinct_property_types, headers=['Property Type', 'Count'], tablefmt='fancy_grid'))
╒═════════════════════════════════════╤═════════╕ │ Property Type │ Count │ ╞═════════════════════════════════════╪═════════╡ │ Entire rental unit │ 3488 │ ├─────────────────────────────────────┼─────────┤ │ Entire serviced apartment │ 889 │ ├─────────────────────────────────────┼─────────┤ │ Entire townhouse │ 811 │ ├─────────────────────────────────────┼─────────┤ │ Entire home │ 739 │ ├─────────────────────────────────────┼─────────┤ │ Entire guest suite │ 632 │ ├─────────────────────────────────────┼─────────┤ │ Private room in home │ 606 │ ├─────────────────────────────────────┼─────────┤ │ Entire condo │ 595 │ ├─────────────────────────────────────┼─────────┤ │ Private room in townhouse │ 522 │ ├─────────────────────────────────────┼─────────┤ │ Private room in rental unit │ 449 │ ├─────────────────────────────────────┼─────────┤ │ Entire residential home │ 307 │ ├─────────────────────────────────────┼─────────┤ │ Entire condominium (condo) │ 300 │ ├─────────────────────────────────────┼─────────┤ │ Private room in residential home │ 294 │ ├─────────────────────────────────────┼─────────┤ │ Room in boutique hotel │ 146 │ ├─────────────────────────────────────┼─────────┤ │ Room in hotel │ 92 │ ├─────────────────────────────────────┼─────────┤ │ Entire guesthouse │ 83 │ ├─────────────────────────────────────┼─────────┤ │ Private room in condo │ 73 │ ├─────────────────────────────────────┼─────────┤ │ Private room in condominium (condo) │ 71 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in rental unit │ 70 │ ├─────────────────────────────────────┼─────────┤ │ Private room in guest suite │ 51 │ ├─────────────────────────────────────┼─────────┤ │ Entire loft │ 51 │ ├─────────────────────────────────────┼─────────┤ │ Private room in bed and breakfast │ 41 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in townhouse │ 29 │ ├─────────────────────────────────────┼─────────┤ │ Private room in loft │ 18 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in home │ 17 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in hostel │ 17 │ ├─────────────────────────────────────┼─────────┤ │ Entire vacation home │ 17 │ ├─────────────────────────────────────┼─────────┤ │ Room in aparthotel │ 16 │ ├─────────────────────────────────────┼─────────┤ │ Entire place │ 13 │ ├─────────────────────────────────────┼─────────┤ │ Room in hostel │ 12 │ ├─────────────────────────────────────┼─────────┤ │ Entire bungalow │ 11 │ ├─────────────────────────────────────┼─────────┤ │ Private room in guesthouse │ 10 │ ├─────────────────────────────────────┼─────────┤ │ Room in bed and breakfast │ 7 │ ├─────────────────────────────────────┼─────────┤ │ Private room in hostel │ 7 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in residential home │ 7 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in guesthouse │ 6 │ ├─────────────────────────────────────┼─────────┤ │ Private room │ 6 │ ├─────────────────────────────────────┼─────────┤ │ Private room in serviced apartment │ 6 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in condominium (condo) │ 6 │ ├─────────────────────────────────────┼─────────┤ │ Entire villa │ 5 │ ├─────────────────────────────────────┼─────────┤ │ Room in serviced apartment │ 4 │ ├─────────────────────────────────────┼─────────┤ │ Private room in villa │ 4 │ ├─────────────────────────────────────┼─────────┤ │ Camper/RV │ 3 │ ├─────────────────────────────────────┼─────────┤ │ Private room in resort │ 3 │ ├─────────────────────────────────────┼─────────┤ │ Casa particular │ 3 │ ├─────────────────────────────────────┼─────────┤ │ Private room in bungalow │ 3 │ ├─────────────────────────────────────┼─────────┤ │ Tiny home │ 2 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in loft │ 2 │ ├─────────────────────────────────────┼─────────┤ │ Private room in casa particular │ 2 │ ├─────────────────────────────────────┼─────────┤ │ Campsite │ 2 │ ├─────────────────────────────────────┼─────────┤ │ Floor │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in hotel │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Entire cottage │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Houseboat │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Tower │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in serviced apartment │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Boat │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Entire home/apt │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Barn │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in guest suite │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Shared room in boutique hotel │ 1 │ ├─────────────────────────────────────┼─────────┤ │ Tent │ 1 │ ╘═════════════════════════════════════╧═════════╛
The most frequent property types include "entire" in the description, initially indicating that you would have the unit to yourself.
This is a category for the listing that indicates:
It should be separated into two descriptors if possible. Part 2 is covered by Room_Type below.
con.execute("select distinct room_type from latest_listings;")
distrinct_room_types = con.fetchall()
distrinct_room_types
[('Private room',), ('Entire home/apt',), ('Shared room',), ('Hotel room',)]
This is a nice categorical in that there are few options:
This covers part 2 of the "property type" above.
Let's see the relationship between property_type and room_type
con.execute("select distinct room_type, property_type, count(property_type) as counts from latest_listings group by room_type, property_type order by room_type, counts desc;")
roomtype_proptype_counts = con.fetchall()
roomtype_proptype_counts_df = pd.DataFrame(roomtype_proptype_counts, columns=['room_type', 'property_type', 'property_type_counts'])
print(tabulate(roomtype_proptype_counts, headers=['room_type', 'property_type', 'counts'], tablefmt='pretty'))
+-----------------+-------------------------------------+--------+ | room_type | property_type | counts | +-----------------+-------------------------------------+--------+ | Entire home/apt | Entire rental unit | 3488 | | Entire home/apt | Entire serviced apartment | 889 | | Entire home/apt | Entire townhouse | 811 | | Entire home/apt | Entire home | 739 | | Entire home/apt | Entire guest suite | 632 | | Entire home/apt | Entire condo | 595 | | Entire home/apt | Entire residential home | 307 | | Entire home/apt | Entire condominium (condo) | 300 | | Entire home/apt | Entire guesthouse | 83 | | Entire home/apt | Entire loft | 51 | | Entire home/apt | Entire vacation home | 17 | | Entire home/apt | Room in aparthotel | 16 | | Entire home/apt | Entire place | 13 | | Entire home/apt | Entire bungalow | 11 | | Entire home/apt | Entire villa | 5 | | Entire home/apt | Camper/RV | 3 | | Entire home/apt | Casa particular | 3 | | Entire home/apt | Tiny home | 2 | | Entire home/apt | Campsite | 2 | | Entire home/apt | Floor | 1 | | Entire home/apt | Entire cottage | 1 | | Entire home/apt | Houseboat | 1 | | Entire home/apt | Tower | 1 | | Entire home/apt | Boat | 1 | | Entire home/apt | Entire home/apt | 1 | | Entire home/apt | Barn | 1 | | Entire home/apt | Tent | 1 | | Hotel room | Room in hostel | 12 | | Hotel room | Room in bed and breakfast | 7 | | Hotel room | Room in serviced apartment | 4 | | Hotel room | Room in boutique hotel | 4 | | Hotel room | Room in hotel | 1 | | Private room | Private room in home | 606 | | Private room | Private room in townhouse | 522 | | Private room | Private room in rental unit | 449 | | Private room | Private room in residential home | 294 | | Private room | Room in boutique hotel | 142 | | Private room | Room in hotel | 91 | | Private room | Private room in condo | 73 | | Private room | Private room in condominium (condo) | 71 | | Private room | Private room in guest suite | 51 | | Private room | Private room in bed and breakfast | 41 | | Private room | Private room in loft | 18 | | Private room | Private room in guesthouse | 10 | | Private room | Private room in hostel | 7 | | Private room | Private room | 6 | | Private room | Private room in serviced apartment | 6 | | Private room | Private room in villa | 4 | | Private room | Private room in resort | 3 | | Private room | Private room in bungalow | 3 | | Private room | Private room in casa particular | 2 | | Shared room | Shared room in rental unit | 70 | | Shared room | Shared room in townhouse | 29 | | Shared room | Shared room in home | 17 | | Shared room | Shared room in hostel | 17 | | Shared room | Shared room in residential home | 7 | | Shared room | Shared room in guesthouse | 6 | | Shared room | Shared room in condominium (condo) | 6 | | Shared room | Shared room in loft | 2 | | Shared room | Shared room in hotel | 1 | | Shared room | Shared room in serviced apartment | 1 | | Shared room | Shared room in guest suite | 1 | | Shared room | Shared room in boutique hotel | 1 | +-----------------+-------------------------------------+--------+
Each of the property types maps to a room type, and it looks like most of the listings in the Washington DC area are for Entire homes/apartments.
We'll plot the counts for each of these four types below:
room_type_list = listings_df['room_type'].unique()
room_type_list
room_type_count = listings_df['room_type'].sort_index().value_counts()
print(room_type_count)
figure = plt.figure()
ax1 = figure.add_subplot(1,1,1)
ax1.bar(room_type_count.index, room_type_count, color=THEME)
ax1.set_xlabel("Room Type")
ax1.set_ylabel("Count")
ax1.set_title("Number of Listings by Room Type")
ax1.set_ylim(0,8000)
ax1.yaxis.grid( visible=True, which="major")
ax1.set_axisbelow(True)
Entire home/apt 7975 Private room 2399 Shared room 158 Hotel room 28 Name: room_type, dtype: int64
Of the ~10,500 listings, almost 8000 were 'Entire home/apt', followed by 'Private room', then very few for 'Shared room' and 'Hotel room'.
This describes the maximum capacity for the unit.
First, we'll look at the distribution.
accommodates_count = listings_df['accommodates'].value_counts().sort_index()
accommodates_count = pd.DataFrame(accommodates_count)
print(tabulate(accommodates_count, headers=['Accomodates', 'Num of Listings'], tablefmt='pretty'))
figure = plt.figure()
ax1 = figure.add_subplot(1,1,1)
ax1.hist(listings_df['accommodates'],density=True)
ax1.set_xlabel('Accomodates')
ax1.set_ylabel('Counts')
ax1.set_title('Accomodates distribution')
+-------------+-----------------+ | Accomodates | Num of Listings | +-------------+-----------------+ | 0 | 5 | | 1 | 1007 | | 2 | 3319 | | 3 | 1487 | | 4 | 2262 | | 5 | 721 | | 6 | 899 | | 7 | 163 | | 8 | 348 | | 9 | 39 | | 10 | 129 | | 11 | 16 | | 12 | 81 | | 13 | 12 | | 14 | 16 | | 15 | 7 | | 16 | 49 | +-------------+-----------------+
Text(0.5, 1.0, 'Accomodates distribution')
More than a quarter of the listings are from the two to four range. There's an oddity where the a property accommodates 0.
listings_df[listings_df.accommodates == 0]
| property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | amenities | price | |
|---|---|---|---|---|---|---|---|---|---|
| 3155 | Room in hostel | Hotel room | 0 | None | None | NaN | NaN | ["Long term stays allowed", "Smoke alarm", "Fi... | 0.0 |
| 3273 | Room in hostel | Hotel room | 0 | None | None | NaN | NaN | ["Long term stays allowed", "Dryer", "Luggage ... | 0.0 |
| 3589 | Room in boutique hotel | Hotel room | 0 | None | None | NaN | NaN | ["43\" HDTV with Chromecast", "Long term stays... | 0.0 |
| 6651 | Room in boutique hotel | Hotel room | 0 | None | None | NaN | NaN | ["Cooked-to-order breakfast available \u2014 $... | 0.0 |
| 9869 | Room in boutique hotel | Hotel room | 0 | None | None | NaN | NaN | ["Heating", "Concierge", "Toiletries", "Fire e... | 0.0 |
Looking at these listings, if accommodates is zero, there is no further information about the listing- The number of bathrooms, bedrooms, and beds are all blank, and the price is 0. This is unreliable, and cannot be used in prediction.
This condition can be used to filter out data that cannot be used in a model.
item_count = listings_df['bedrooms'].value_counts().sort_index()
item_count = pd.DataFrame(item_count)
print(tabulate(item_count, headers=['Bedrooms', 'Num of Listings'], tablefmt='pretty'))
figure = plt.figure()
ax1 = figure.add_subplot(1,1,1)
ax1.hist(listings_df['bedrooms'],density=True, bins=20)
ax1.set_xlabel('Bedrooms')
ax1.set_ylabel('Density')
ax1.set_title('Bedrooms distribution')
+----------+-----------------+ | Bedrooms | Num of Listings | +----------+-----------------+ | 1.0 | 6225 | | 2.0 | 2089 | | 3.0 | 822 | | 4.0 | 324 | | 5.0 | 90 | | 6.0 | 34 | | 7.0 | 10 | | 8.0 | 1 | | 9.0 | 3 | | 11.0 | 1 | | 15.0 | 1 | +----------+-----------------+
Text(0.5, 1.0, 'Bedrooms distribution')
Roughly 80% of the listings are for 1-2 bedrooms.
item_count = listings_df['beds'].value_counts().sort_index()
item_count = pd.DataFrame(item_count)
print(tabulate(item_count, headers=['beds', 'Num of Listings'], tablefmt='pretty'))
figure = plt.figure()
ax1 = figure.add_subplot(1,1,1)
ax1.hist(listings_df['beds'],density=True, bins=range(1,50,1))
ax1.set_xlabel('beds')
ax1.set_ylabel('Density')
ax1.set_title('Beds distribution')
+------+-----------------+ | beds | Num of Listings | +------+-----------------+ | 1.0 | 5177 | | 2.0 | 2953 | | 3.0 | 1124 | | 4.0 | 512 | | 5.0 | 245 | | 6.0 | 148 | | 7.0 | 48 | | 8.0 | 38 | | 9.0 | 16 | | 10.0 | 9 | | 11.0 | 6 | | 12.0 | 10 | | 16.0 | 2 | | 18.0 | 1 | | 20.0 | 1 | | 50.0 | 1 | +------+-----------------+
Text(0.5, 1.0, 'Beds distribution')
The vast majority are for 1-2 beds, with a surprising maximum at 50.
listings_df.amenities.head()
0 ["Dryer", "Smoke alarm", "Cooking basics", "In... 1 ["Smoke alarm", "Cooking basics", "Bed linens"... 2 ["Dryer", "Smoke alarm", "Single level home", ... 3 ["Dryer", "TV with standard cable", "Cooking b... 4 ["Dryer", "Smoke alarm", "Paid parking on prem... Name: amenities, dtype: object
We want to investigate the reviews to price of airbnbs. For reviews we have many different factors that a guest can rate a host on. There is the overall review score and other sub-scores. The assumption is that a higher review score means the host can charge a higher price.
The variables we are interested in from the all listings table are reviews.
The data dictionary does not list the details on this variable. We'll have to explore what it might be and figure it our from there.
latest_listings.review_scores_rating.describe()
count 7740.000000 mean 4.680700 std 0.636749 min 0.000000 25% 4.660000 50% 4.850000 75% 4.970000 max 5.000000 Name: review_scores_rating, dtype: float64
There are 21844 review score listings so some of our listings don't have a review score listed. That is interesting. We may have to cull those null rows from our data to match the sizes of price and review scores. Most of the reviews are 4s with a few differences .637 std.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_rating,bins=5, color="darkslategray")
axes.set_title("Review scores distribution")
axes.set_xlabel("review scores")
plt.show()
plt.close()
We have a few zeros but mostly 4's and 5's. Not all that interesting.
There is no entry for this in the data dictionary but I think it is basically the review score that measures how accurate the description of the listing was to the actual property that the guest was staying in. This is actually fairly important because it alerts someone if you are being shown fake images of a listing that is actually not the same as the one you are staying in.
latest_listings.review_scores_accuracy.describe()
count 7656.000000 mean 4.792516 std 0.397878 min 0.000000 25% 4.760000 50% 4.910000 75% 5.000000 max 5.000000 Name: review_scores_accuracy, dtype: float64
First thing I notice is the number of reviews is different for this one as well with only 21649 compared to 28000+ listings. It's also lower than our previous variable scores_rating. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_accuracy,bins=5, color="darkslategray")
axes.set_title("Review scores distribution")
axes.set_xlabel("review scores")
plt.show()
plt.close()
There is no entry for this in the data dictionary but I think it is basically the review score that measures how clean the property was when the guest stayed there.
latest_listings.review_scores_cleanliness.describe()
count 7657.000000 mean 4.734177 std 0.420437 min 0.000000 25% 4.670000 50% 4.870000 75% 5.000000 max 5.000000 Name: review_scores_cleanliness, dtype: float64
Again there is a different number of total reviews than the other review variables and total. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_cleanliness,bins=5, color="darkslategray")
axes.set_title("Review scores distribution")
axes.set_xlabel("review scores")
plt.show()
plt.close()
There is no entry for this in the data dictionary but I think it is basically the review score that measures how easy and simple the check in process was for the property.
latest_listings.review_scores_checkin.describe()
count 7654.000000 mean 4.849888 std 0.359567 min 0.000000 25% 4.860000 50% 4.960000 75% 5.000000 max 5.000000 Name: review_scores_checkin, dtype: float64
Again there is a different number of total reviews than the other review variables and total. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_checkin,bins=5, color="darkslategray")
axes.set_title("Checkin scores distribution")
axes.set_xlabel("Scores")
plt.show()
plt.close()
There is no entry for this in the data dictionary but I think it is basically the review score that measures how communicative the host was during the whole stay, from responding to questions pre-checkin and any thing else during the stay at the property.
latest_listings.review_scores_communication.describe()
count 7657.000000 mean 4.838004 std 0.378617 min 0.000000 25% 4.840000 50% 4.960000 75% 5.000000 max 5.000000 Name: review_scores_communication, dtype: float64
Again there is a different number of total reviews, but this one matches with the number who left a review for cleaning. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_communication,bins=5, color="darkslategray")
axes.set_title("communication scores distribution")
axes.set_xlabel("Scores")
plt.show()
plt.close()
There is no entry for this in the data dictionary but I think it is basically the review score that measures the location and neighborhood the listing is in. This could be useful for our crime data comparison.
latest_listings.review_scores_location.describe()
count 7655.000000 mean 4.780760 std 0.372359 min 0.000000 25% 4.715000 50% 4.900000 75% 5.000000 max 5.000000 Name: review_scores_location, dtype: float64
There is a different number of total reviews than the other review variables and total. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_location,bins=5, color="darkslategray")
axes.set_title("Location scores distribution")
axes.set_xlabel("scores")
plt.show()
plt.close()
There is no entry for this in the data dictionary but I think it is basically the review score that measures the "bang for you buck" metric. Such as how affordable the space was, did the price per night match up well with the amenities and other things offered by the host.
latest_listings.review_scores_value.describe()
count 7652.000000 mean 4.697402 std 0.427266 min 0.000000 25% 4.640000 50% 4.800000 75% 4.920000 max 5.000000 Name: review_scores_value, dtype: float64
There is a different number of total reviews than the other review variables and total. It's basically the same distribution of scores, mostly high 4s and 5s.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.review_scores_value,bins=5, color="darkslategray")
axes.set_title("Value scores distribution")
axes.set_xlabel("scores")
plt.show()
plt.close()
The number of reviews the listing has.
latest_listings.number_of_reviews.describe()
count 10560.000000 mean 39.426136 std 74.913022 min 0.000000 25% 0.000000 50% 8.000000 75% 42.000000 max 875.000000 Name: number_of_reviews, dtype: float64
The mean is 44 reviews per listing with an std of 77, this is very skewed. It likely means there are a lot of properties that have been on airbnb for a long time, additionally, not many people leave reviews, I think the data estimates less than 30% leave a review after their stay. That would account for 1/4th having only 1 review and the median being 10 reviews.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.number_of_reviews,bins=10, color="darkslategray")
axes.set_title("Number of reviews distribution")
axes.set_xlabel("reviews")
plt.show()
plt.close()
Let's look at just a range between 0-20 total reviews.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.number_of_reviews,range=[0,20],bins=20, color="darkslategray")
axes.set_title("Number of reviews distribution")
axes.set_xlabel("reviews")
plt.show()
plt.close()
This is interesting, it looks like a large number of properties have less than 2 total reviews. That's going to skew our data considerably. Let's see how many are zero reviews.
print(latest_listings[latest_listings["number_of_reviews"] == 0 ]["number_of_reviews"])
37 0
49 0
68 0
94 0
129 0
..
10552 0
10555 0
10557 0
10558 0
10559 0
Name: number_of_reviews, Length: 2820, dtype: int32
It seems that 22% of our listings have 0 reviews! That's very high.
The number of reviews the listing has (in the last 12 months). I'm hoping this matches our four quarters as measured.
latest_listings.number_of_reviews_ltm.describe()
count 10560.000000 mean 10.165814 std 19.074238 min 0.000000 25% 0.000000 50% 1.000000 75% 12.000000 max 341.000000 Name: number_of_reviews_ltm, dtype: float64
In the last 12 months, there were a mean number of 11 reviews for our listings. The median is 2 and the 3rd quartile is 15. There is a min of 0 reviews and a max of 341 reviews, that's almost one review per day for that listing!
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.number_of_reviews_ltm,bins=10, color="darkslategray")
axes.set_title("Number of reviews distribution")
axes.set_xlabel("reviews")
plt.show()
plt.close()
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.number_of_reviews_ltm,range=[0,15],bins=15, color="darkslategray")
axes.set_title("Number of reviews 12 months distribution")
axes.set_xlabel("reviews")
plt.show()
plt.close()
This variable is the number of reviews the listing has (in the last 30 days).
latest_listings.number_of_reviews_l30d.describe()
count 10560.000000 mean 0.768845 std 2.097824 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 119.000000 Name: number_of_reviews_l30d, dtype: float64
In the last 30 months, there were a mean number of less than 1 reviews for our listings. The median is 0 and the 3rd quartile is 1. There is a min of 0 reviews and a max of 119 reviews.
The number of reviews the listing has over the lifetime of the listing.
latest_listings.reviews_per_month.describe()
count 7740.000000 mean 1.602641 std 2.193940 min 0.010000 25% 0.240000 50% 0.940000 75% 2.340000 max 105.000000 Name: reviews_per_month, dtype: float64
In our listings there were a mean number of less than 1 reviews for the lifetime of listings. The median is 0 and the 3rd quartile is 2. There is a min of 0 reviews and a max of 105 reviews.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(latest_listings.reviews_per_month,range=[0,5],bins=15, color="darkslategray")
axes.set_title("Number of reviews distribution")
axes.set_xlabel("reviews")
plt.show()
plt.close()
The AirBNB data includes a large calendar data set, and we want to investigate the usefulness of it. One omission from this data set is the indicator of when a listing was actually booked. Having this indicator would allow us to determine actual booked listings, and possibly revenues.
The variables from the calendar table are:
* listing_id
* date
* available
* price
* adjusted_price
* minimum_nights
* maximum_nights
calendar_listings = con.execute('SELECT * from calendar').df()
calendar_listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10245531 entries, 0 to 10245530 Data columns (total 7 columns): # Column Dtype --- ------ ----- 0 listing_id int64 1 date datetime64[ns] 2 available object 3 price float64 4 adjusted_price float64 5 minimum_nights float64 6 maximum_nights float64 dtypes: datetime64[ns](1), float64(4), int64(1), object(1) memory usage: 547.2+ MB
The calendar table has 10 million rows. This is likely because we have an entry for each date a property is available on airbnb.
calendar_listings.head()
| listing_id | date | available | price | adjusted_price | minimum_nights | maximum_nights | |
|---|---|---|---|---|---|---|---|
| 0 | 3943 | 2021-12-15 | t | 70.0 | 70.0 | 2.0 | 1125.0 |
| 1 | 3943 | 2021-12-16 | t | 70.0 | 70.0 | 2.0 | 1125.0 |
| 2 | 3943 | 2021-12-17 | f | 70.0 | 70.0 | 2.0 | 1125.0 |
| 3 | 3943 | 2021-12-18 | f | 70.0 | 70.0 | 2.0 | 1125.0 |
| 4 | 3943 | 2021-12-19 | t | 70.0 | 70.0 | 2.0 | 1125.0 |
We have listing_id which is unique and for each data the the listing is available there is an entry for it in the database and a price. We want to see if there is any relationship between price.
We did a thorough investigation of the calendar data set, but did not find data that could be used to explain or model. The investigation can be found in the EDA notebook (eda.ipynb). For this report, we will omit the analysis.
The crime data includes location data, crime type, and datetime information. Each row is a single crime.
con.execute('select * from crimes;')
crime_data = con.fetch_df()
print(crime_data)
column00 NEIGHBORHOOD_CLUSTER offensegroup LONGITUDE \
0 0 cluster 25 violent -77.002778
1 1 cluster 26 violent -76.997326
2 2 cluster 35 violent -76.964260
3 3 cluster 39 violent -76.999896
4 4 cluster 2 property -77.025347
... ... ... ... ...
27606 27606 cluster 1 property -77.042324
27607 27607 cluster 1 property -77.042369
27608 27608 cluster 26 property -76.994963
27609 27609 cluster 11 property -77.081165
27610 27610 cluster 25 property -77.001314
END_DATE offense-text SHIFT \
0 9/7/2021, 2:18:00 PM homicide midnight
1 10/3/2021, 3:20:00 PM robbery evening
2 10/3/2021, 6:40:00 PM robbery evening
3 10/11/2021, 8:52:00 PM assault w/dangerous weapon midnight
4 10/2/2021, 11:01:00 PM theft f/auto evening
... ... ... ...
27606 7/21/2022, 7:00:00 PM theft/other midnight
27607 8/12/2022, 6:30:00 AM theft/other midnight
27608 8/12/2022, 9:00:00 AM theft f/auto midnight
27609 5/31/2022, 7:00:00 AM theft f/auto day
27610 5/31/2022, 11:37:00 AM theft f/auto day
offensekey START_DATE \
0 violent|homicide 2021-09-07
1 violent|robbery 2021-10-03
2 violent|robbery 2021-10-03
3 violent|assault w/dangerous weapon 2021-10-11
4 property|theft f/auto 2021-10-02
... ... ...
27606 property|theft/other 2022-07-21
27607 property|theft/other 2022-08-11
27608 property|theft f/auto 2022-08-12
27609 property|theft f/auto 2022-05-26
27610 property|theft f/auto 2022-05-27
OFFENSE METHOD LATITUDE \
0 homicide others 38.897331
1 robbery others 38.881269
2 robbery gun 38.858661
3 assault w/dangerous weapon others 38.831632
4 theft f/auto others 38.922290
... ... ... ...
27606 theft/other others 38.926303
27607 theft/other others 38.924303
27608 theft f/auto others 38.891322
27609 theft f/auto others 38.955947
27610 theft f/auto others 38.898907
NEIGHBORHOOD_NAME
0 Union Station, Stanton Park, Kingman Park
1 Capitol Hill, Lincoln Park
2 Fairfax Village, Naylor Gardens, Hillcrest, Su...
3 Congress Heights, Bellevue, Washington Highlands
4 Columbia Heights, Mt. Pleasant, Pleasant Plain...
... ...
27606 Kalorama Heights, Adams Morgan, Lanier Heights
27607 Kalorama Heights, Adams Morgan, Lanier Heights
27608 Capitol Hill, Lincoln Park
27609 Friendship Heights, American University Park, ...
27610 Union Station, Stanton Park, Kingman Park
[27611 rows x 13 columns]
Neighborhood cluster represents which neighbhor hood the crime happened. There are 45 clusters in this column and they can be matched to a specific neighbhor hood. The most frequent one is cluster 25. This value tied close to the neighbor hood name column. This is a categorical variable.
crime_data.NEIGHBORHOOD_CLUSTER.describe()
count 27609 unique 45 top cluster 25 freq 2070 Name: NEIGHBORHOOD_CLUSTER, dtype: object
crime_data.NEIGHBORHOOD_CLUSTER.value_counts()
cluster 25 2070 cluster 8 2055 cluster 2 1788 cluster 3 1598 cluster 23 1335 cluster 6 1276 cluster 18 1270 cluster 7 1125 cluster 21 1115 cluster 26 1095 cluster 22 996 cluster 39 990 cluster 34 752 cluster 17 747 cluster 1 729 cluster 33 693 cluster 31 667 cluster 27 592 cluster 4 566 cluster 32 561 cluster 9 501 cluster 19 488 cluster 24 452 cluster 38 367 cluster 11 365 cluster 30 359 cluster 15 329 cluster 20 319 cluster 35 315 cluster 5 301 cluster 14 292 cluster 37 266 cluster 28 225 cluster 12 219 cluster 36 207 cluster 10 188 cluster 16 148 cluster 13 91 cluster 45 48 cluster 29 44 cluster 43 32 cluster 40 15 cluster 44 7 cluster 46 7 cluster 41 4 Name: NEIGHBORHOOD_CLUSTER, dtype: int64
There are two types of offense crime in this variable, Property and violent, so it's a categorical variable. We can see from the summary that majority of the offense is property.
crime_data.offensegroup.value_counts()
property 23567 violent 4044 Name: offensegroup, dtype: int64
This is a pretty straightforward variable. It represents the longtitude of the crime. Longtitude is a continuous numerical data and let's a take a look at the summary of it.
crime_data.LONGITUDE.describe()
count 27611.000000 mean -77.007869 std 0.034340 min -77.112316 25% -77.030787 50% -77.010666 75% -76.986919 max -76.910021 Name: LONGITUDE, dtype: float64
The min here is -77.1123 and max is -76.9100. Since our crime data is based on DC area, so this make perfect sense. Now let's take a look at the distribution of it.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(crime_data.LONGITUDE, color="darkslategray")
axes.set_title("Distribution of Longtitude")
axes.set_xlabel("Longtitude")
plt.show()
plt.close()
Here the distribution looks very nearly normal. 50% of the crimes has the longtitude between -77.050 to -76.975.
crime_data.END_DATE.describe()
count 25411 unique 22106 top 5/25/2022, 7:00:00 AM freq 7 Name: END_DATE, dtype: object
This is another variable to describle the type of crime. It's a categorical variable. Let's take a look at the summary of it.
crime_data['offense-text'].describe()
count 27611 unique 9 top theft/other freq 10783 Name: offense-text, dtype: object
crime_data['offense-text'].value_counts()
theft/other 10783 theft f/auto 8077 motor vehicle theft 3535 robbery 2167 assault w/dangerous weapon 1499 burglary 1168 homicide 221 sex abuse 157 arson 4 Name: offense-text, dtype: int64
We can see there are 9 different types of offsense. Most of them are theft/other and theft/auto which aligned well with our offense group variable.
crime_data.SHIFT.value_counts()
evening 11109 day 10863 midnight 5639 Name: SHIFT, dtype: int64
We can see that majority of the crimes happened during evening and daytime.
Offense key is another variable describe the type of crimes. It's a aggregation of offensegroup and offense-text. It's also a categorical variable.
crime_data.offensekey.describe()
count 27611 unique 9 top property|theft/other freq 10783 Name: offensekey, dtype: object
crime_data.offensekey.value_counts()
property|theft/other 10783 property|theft f/auto 8077 property|motor vehicle theft 3535 violent|robbery 2167 violent|assault w/dangerous weapon 1499 property|burglary 1168 violent|homicide 221 violent|sex abuse 157 property|arson 4 Name: offensekey, dtype: int64
We can see that there are 9 unique types and the top 3 are property related crimes which are theft/other, theft/auto and motor vehicle theft.
Similar to end date, this variable represents the time and date when the crime start. It's a date type.
crime_data.START_DATE.describe()
count 27602 unique 459 top 2021-10-06 00:00:00 freq 123 first 2010-05-28 00:00:00 last 2022-09-29 00:00:00 Name: START_DATE, dtype: object
crime_data.OFFENSE.describe()
count 27611 unique 9 top theft/other freq 10783 Name: OFFENSE, dtype: object
crime_data.OFFENSE.value_counts()
theft/other 10783 theft f/auto 8077 motor vehicle theft 3535 robbery 2167 assault w/dangerous weapon 1499 burglary 1168 homicide 221 sex abuse 157 arson 4 Name: OFFENSE, dtype: int64
Similar to the results above for offense-text, there are 9 types of crimes.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(crime_data.OFFENSE, color="darkslategray")
axes.set_title("Distribution of offense tyoe")
axes.set_xlabel("offense")
plt.show()
plt.close()
crime_data.METHOD.describe()
count 27611 unique 3 top others freq 24756 Name: METHOD, dtype: object
crime_data.METHOD.value_counts()
others 24756 gun 2333 knife 522 Name: METHOD, dtype: int64
We can see there are 3 methods in it, others, gun and knife. Most of them are others.
crime_data.LATITUDE.describe()
count 27611.000000 mean 38.907542 std 0.029605 min 38.814659 25% 38.894325 50% 38.907235 75% 38.923862 max 38.994901 Name: LATITUDE, dtype: float64
The min here is 38.8147 and max is 38.9949. Since our crime data is based on DC area, so this make perfect sense. Now let's take a look at the distribution of it.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.hist(crime_data.LATITUDE, color="darkslategray")
axes.set_title("Distribution of Latitude")
axes.set_xlabel("Latitude")
plt.show()
plt.close()
Here the distribution looks very nearly normal. 50% of the longtitude has the latitude between 38.875 to 38.925.
This is a new variable we added to the dataset. This is the actual name for the neighborhood cluster. Let's take a look at the summary.
crime_data.NEIGHBORHOOD_NAME.describe()
count 27611 unique 46 top Union Station, Stanton Park, Kingman Park freq 2070 Name: NEIGHBORHOOD_NAME, dtype: object
crime_data.NEIGHBORHOOD_NAME.value_counts()
Union Station, Stanton Park, Kingman Park 2070 Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street 2055 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View 1788 Howard University, Le Droit Park, Cardozo/Shaw 1598 Ivy City, Arboretum, Trinidad, Carver Langston 1335 Dupont Circle, Connecticut Avenue/K Street 1276 Brightwood Park, Crestwood, Petworth 1270 Shaw, Logan Circle 1125 Edgewood, Bloomingdale, Truxton Circle, Eckington 1115 Capitol Hill, Lincoln Park 1095 Brookland, Brentwood, Langdon 996 Congress Heights, Bellevue, Washington Highlands 990 Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont 752 Takoma, Brightwood, Manor Park 747 Kalorama Heights, Adams Morgan, Lanier Heights 729 Capitol View, Marshall Heights, Benning Heights 693 Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights 667 Near Southeast, Navy Yard 592 Georgetown, Burleith/Hillandale 566 River Terrace, Benning, Greenway, Dupont Park 561 Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point 501 Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill 488 Woodridge, Fort Lincoln, Gateway 452 Douglas, Shipley Terrace 367 Friendship Heights, American University Park, Tenleytown 365 Mayfair, Hillbrook, Mahaning Heights 359 Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace 329 North Michigan Park, Michigan Park, University Heights 319 Fairfax Village, Naylor Gardens, Hillcrest, Summit Park 315 West End, Foggy Bottom, GWU 301 Cathedral Heights, McLean Gardens, Glover Park 292 Sheridan, Barry Farm, Buena Vista 266 Historic Anacostia 225 North Cleveland Park, Forest Hills, Van Ness 219 Woodland/Fort Stanton, Garfield Heights, Knox Hill 207 Hawthorne, Barnaby Woods, Chevy Chase 188 Colonial Village, Shepherd Park, North Portal Estates 148 Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir 91 National Mall, Potomac River 48 Eastland Gardens, Kenilworth 44 Saint Elizabeths 32 Walter Reed 15 Joint Base Anacostia-Bolling 7 Arboretum, Anacostia River 7 Rock Creek Park 4 N/A 2 Name: NEIGHBORHOOD_NAME, dtype: int64
Let's look at the pairwise analysis of neighborhood and price. I expect AirBnB price to track general rent cost of the D.C. areas. Neighborhoods with higher rent will have a higher nightly AirBnB price on average and the reverse would be true for areas with low rent. From general domain knowledge about Washington D.C., I know that the downtown area and the Northwest quadrant have higher rents on average than the Northeast and Southeast quadrant. Therefore I expect AirBnB nightly prices to be higher on average in the downtown area and the Northwest quadrant.
One thing we will need to be mindful of in doing this analysis is the comparison of counts of AirBnB listings in each neighborhood cluster. Some clusters might have a high density of AirBnBs and others will have much fewer. Let's get the counts in tabular form as well as plotting them on the D.C. map. First, we must remove duplicate listings.
# Code from Lab 5 Solution
def describe_by_category(data, numeric, categorical, transpose=False):
grouped = data.groupby(categorical)
grouped_y = grouped[numeric].describe()
if transpose:
print(grouped_y.transpose())
else:
print(grouped_y)
# Code from Module 5
def classify_correlation(r):
r = abs(r)
if r < 0.16:
return "very weak"
if r < 0.29:
return "weak"
if r < 0.49:
return "low"
if r < 0.69:
return "moderate"
if r < 0.89:
return "strong"
return "very strong"
# Code from Module 5
def correlation(data, x, y):
print("Correlation coefficients:")
r = stats.pearsonr(data[x], data[y])[0]
print( "r =", r, f"({classify_correlation(r)})")
rho = stats.spearmanr(data[x], data[y])[0]
print( "rho =", rho, f"({classify_correlation(rho)})")
# Code from Module 5
def lowess_scatter(data, x, y, jitter=0.0, skip_lowess=False):
if skip_lowess:
fit = np.polyfit(data[x], data[y], 1)
line_x = np.linspace(data[x].min(), data[x].max(), 10)
line = np.poly1d(fit)
line_y = list(map(line, line_x))
else:
lowess = sm.nonparametric.lowess(data[y], data[x], frac=.3)
line_x = list(zip(*lowess))[0]
line_y = list(zip(*lowess))[1]
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
xs = data[x]
if jitter > 0.0:
xs = data[x] + stats.norm.rvs( 0, 0.5, data[x].size)
axes.scatter(xs, data[y], marker="o", color="DimGray", alpha=0.5)
axes.plot(line_x, line_y, color="DarkRed")
title = "Plot of {0} v. {1}".format(x, y)
if not skip_lowess:
title += " with LOWESS"
axes.set_title(title)
axes.set_xlabel(x)
axes.set_ylabel(y)
plt.show()
plt.close()
# Code from Module 5
def restyle_boxplot(patch):
## change color and linewidth of the whiskers
for whisker in patch['whiskers']:
whisker.set(color='#000000', linewidth=1)
## change color and linewidth of the caps
for cap in patch['caps']:
cap.set(color='#000000', linewidth=1)
## change color and linewidth of the medians
for median in patch['medians']:
median.set(color='#000000', linewidth=2)
## change the style of fliers and their fill
for flier in patch['fliers']:
flier.set(marker='o', color='#000000', alpha=0.2)
for box in patch["boxes"]:
box.set( facecolor='#FFFFFF', alpha=0.5)
# Code from Module 5
def multiboxplot(data, numeric, categorical, skip_data_points=True):
figure = plt.figure(figsize=(30, 6))
axes = figure.add_subplot(1, 1, 1)
grouped = data.groupby(categorical)
labels = pd.unique(data[categorical].values)
labels.sort()
grouped_data = [grouped[numeric].get_group( k) for k in labels]
patch = axes.boxplot( grouped_data, labels=labels, patch_artist=True, zorder=1)
restyle_boxplot( patch)
if not skip_data_points:
for i, k in enumerate(labels):
subdata = grouped[numeric].get_group( k)
x = np.random.normal(i + 1, 0.01, size=len(subdata))
axes.plot(x, subdata, 'o', alpha=0.4, color="DimGray", zorder=2)
axes.set_xlabel(categorical)
axes.set_xticklabels(labels, rotation=90, ha='right')
axes.set_ylabel(numeric)
axes.set_title("Distribution of {0} by {1}".format(numeric, categorical))
plt.show()
plt.close()
neighborhood vs. price ¶We want to look at the interaction between neighborhood and price. Just like non-rental, non-hospitality real estate, AirBnB nightly price will vary by neighborhood. Nicer neighborhoods will generally have more expensive AirBnBs whereas poorer neighborhoods will generally have cheaper AirBnBs.
Other factors that contribute to the correlation between neighborhood and price is the amenity makeup of a neighborhood. Neighborhoods with more access to amenities and attractions will have more expensive AirBnBs. For this reason, I suspect that neighborhoods near the central business district (downtown) will have more expensive AirBnBs.
For the remainder of this analysis we will use only the latest listings from the most recent quarter.
latest_listings = con.execute("select * from latest_listings;").df()
latest_listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10560 entries, 0 to 10559 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10560 non-null int64 1 listing_url 10560 non-null object 2 scrape_id 10560 non-null int64 3 last_scraped 10560 non-null datetime64[ns] 4 name 10559 non-null object 5 description 10416 non-null object 6 neighborhood_overview 6559 non-null object 7 picture_url 10560 non-null object 8 host_id 10560 non-null int32 9 host_url 10560 non-null object 10 host_name 10398 non-null object 11 host_since 10398 non-null datetime64[ns] 12 host_location 9548 non-null object 13 host_about 6013 non-null object 14 host_response_time 10398 non-null object 15 host_response_rate 7452 non-null float64 16 host_acceptance_rate 8120 non-null float64 17 host_is_superhost 10554 non-null object 18 host_thumbnail_url 10398 non-null object 19 host_picture_url 10398 non-null object 20 host_neighbourhood 9583 non-null object 21 host_listings_count 10398 non-null float64 22 host_total_listings_count 10398 non-null float64 23 host_verifications 10560 non-null object 24 host_has_profile_pic 10398 non-null object 25 host_identity_verified 10398 non-null object 26 neighbourhood 6559 non-null object 27 neighbourhood_cleansed 10560 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 10560 non-null float64 30 longitude 10560 non-null float64 31 property_type 10560 non-null object 32 room_type 10560 non-null object 33 accommodates 10560 non-null int32 34 bathrooms 0 non-null float64 35 bathrooms_text 10546 non-null object 36 bedrooms 9600 non-null float64 37 beds 10291 non-null float64 38 amenities 10560 non-null object 39 price 10560 non-null float64 40 minimum_nights 10560 non-null int32 41 maximum_nights 10560 non-null int32 42 minimum_minimum_nights 10557 non-null float64 43 maximum_minimum_nights 10557 non-null float64 44 minimum_maximum_nights 10557 non-null float64 45 maximum_maximum_nights 10557 non-null float64 46 minimum_nights_avg_ntm 10557 non-null float64 47 maximum_nights_avg_ntm 10557 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 10560 non-null bool 50 availability_30 10560 non-null int32 51 availability_60 10560 non-null int32 52 availability_90 10560 non-null int32 53 availability_365 10560 non-null int32 54 calendar_last_scraped 10560 non-null datetime64[ns] 55 number_of_reviews 10560 non-null int32 56 number_of_reviews_ltm 10560 non-null int32 57 number_of_reviews_l30d 10560 non-null int32 58 first_review 7740 non-null datetime64[ns] 59 last_review 7740 non-null datetime64[ns] 60 review_scores_rating 7740 non-null float64 61 review_scores_accuracy 7656 non-null float64 62 review_scores_cleanliness 7657 non-null float64 63 review_scores_checkin 7654 non-null float64 64 review_scores_communication 7657 non-null float64 65 review_scores_location 7655 non-null float64 66 review_scores_value 7652 non-null float64 67 license 3427 non-null object 68 instant_bookable 10560 non-null bool 69 calculated_host_listings_count 10560 non-null int32 70 calculated_host_listings_count_entire_homes 10560 non-null int32 71 calculated_host_listings_count_private_rooms 10560 non-null int32 72 calculated_host_listings_count_shared_rooms 10560 non-null int32 73 reviews_per_month 7740 non-null float64 dtypes: bool(2), datetime64[ns](5), float64(26), int32(15), int64(2), object(24) memory usage: 5.2+ MB
latest_listings_gdf = gpd.GeoDataFrame(
latest_listings, geometry=gpd.points_from_xy(latest_listings.longitude, latest_listings.latitude))
Let's look at where the AirBnBs are on the neighborhood map.
type(latest_listings_gdf.geometry)
geopandas.geoseries.GeoSeries
fig,ax = plt.subplots(figsize = (15,15))
base = neighborhoods_gpd.plot(ax=ax)
for idx, row in neighborhoods_gpd.iterrows():
plt.annotate(row['alpha_index'], xy=row['rep_point'],
horizontalalignment='center')
latest_listings_gdf.geometry.plot(ax=base, marker='o', color='red', markersize=5);
plt.show()
It turns out there are listings that exist outside of the DC borders. There are listings in the Arlington are of Virginia and various parts of Maryland. The listing densities of each neighbrhood cluster vary immensely. The clusters close to the downtown/National Mall area have the highest density of AirBnBs. The density becomes smaller the further out you go from downtown. Let's look at the densities in tabular form:
describe_by_category(latest_listings, "price", "neighbourhood_cleansed", transpose=True)
neighbourhood_cleansed Brightwood Park, Crestwood, Petworth \ count 518.000000 mean 142.733591 std 161.809158 min 20.000000 25% 69.000000 50% 95.000000 75% 150.000000 max 1500.000000 neighbourhood_cleansed Brookland, Brentwood, Langdon \ count 152.000000 mean 130.921053 std 107.605087 min 29.000000 25% 70.000000 50% 99.000000 75% 150.000000 max 675.000000 neighbourhood_cleansed Capitol Hill, Lincoln Park \ count 858.000000 mean 230.157343 std 373.065396 min 10.000000 25% 100.000000 50% 142.000000 75% 219.750000 max 5995.000000 neighbourhood_cleansed Capitol View, Marshall Heights, Benning Heights \ count 144.000000 mean 165.534722 std 426.121276 min 26.000000 25% 59.750000 50% 83.500000 75% 165.500000 max 5000.000000 neighbourhood_cleansed Cathedral Heights, McLean Gardens, Glover Park \ count 129.000000 mean 370.658915 std 1202.157874 min 10.000000 25% 80.000000 50% 125.000000 75% 210.000000 max 9999.000000 neighbourhood_cleansed Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace \ count 99.000000 mean 201.939394 std 273.239180 min 28.000000 25% 81.500000 50% 110.000000 75% 168.500000 max 1900.000000 neighbourhood_cleansed Colonial Village, Shepherd Park, North Portal Estates \ count 199.000000 mean 144.356784 std 148.159545 min 26.000000 25% 77.500000 50% 119.000000 75% 190.500000 max 1895.000000 neighbourhood_cleansed Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View \ count 817.000000 mean 155.922889 std 144.659184 min 25.000000 25% 79.000000 50% 112.000000 75% 181.000000 max 1425.000000 neighbourhood_cleansed Congress Heights, Bellevue, Washington Highlands \ count 171.000000 mean 106.824561 std 109.729956 min 26.000000 25% 48.000000 50% 75.000000 75% 120.000000 max 1000.000000 neighbourhood_cleansed Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights \ count 145.000000 mean 124.220690 std 125.035799 min 22.000000 25% 56.000000 50% 80.000000 75% 130.000000 max 697.000000 neighbourhood_cleansed ... Shaw, Logan Circle \ count ... 652.000000 mean ... 223.815951 std ... 309.828697 min ... 0.000000 25% ... 107.750000 50% ... 156.000000 75% ... 234.000000 max ... 4357.000000 neighbourhood_cleansed Sheridan, Barry Farm, Buena Vista \ count 55.000000 mean 140.472727 std 120.289905 min 45.000000 25% 59.500000 50% 81.000000 75% 175.000000 max 499.000000 neighbourhood_cleansed Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point \ count 204.000000 mean 275.127451 std 302.317621 min 0.000000 25% 113.000000 50% 200.000000 75% 291.000000 max 2499.000000 neighbourhood_cleansed Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir \ count 227.000000 mean 248.885463 std 407.112664 min 29.000000 25% 109.000000 50% 165.000000 75% 202.000000 max 5000.000000 neighbourhood_cleansed Takoma, Brightwood, Manor Park \ count 258.000000 mean 161.151163 std 625.179486 min 28.000000 25% 59.000000 50% 88.500000 75% 150.000000 max 10000.000000 neighbourhood_cleansed Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont \ count 130.000000 mean 166.392308 std 204.495238 min 20.000000 25% 65.250000 50% 98.500000 75% 198.250000 max 1500.000000 neighbourhood_cleansed Union Station, Stanton Park, Kingman Park \ count 869.000000 mean 213.681243 std 310.728260 min 29.000000 25% 96.000000 50% 140.000000 75% 221.000000 max 5000.000000 neighbourhood_cleansed West End, Foggy Bottom, GWU \ count 326.000000 mean 273.766871 std 702.041647 min 40.000000 25% 97.250000 50% 165.000000 75% 225.750000 max 10000.000000 neighbourhood_cleansed Woodland/Fort Stanton, Garfield Heights, Knox Hill \ count 15.000000 mean 148.666667 std 92.913682 min 30.000000 25% 72.000000 50% 130.000000 75% 214.500000 max 313.000000 neighbourhood_cleansed Woodridge, Fort Lincoln, Gateway count 85.000000 mean 141.282353 std 148.386449 min 25.000000 25% 64.000000 50% 95.000000 75% 150.000000 max 900.000000 [8 rows x 39 columns]
This data is pretty hard to consume but we can see a stark difference in the count of AirBnBs per neighborhood cluster. Let's plot the average price for each neighborhood using box plots to get a better feel for the comparative price
latest_listings_below_1500 = latest_listings[latest_listings.price < 1500]
latest_listings_below_500 = latest_listings[latest_listings.price < 500]
multiboxplot(latest_listings_below_1500, "price", "neighbourhood_cleansed")
From this plot, we can see that no neighborhood has a median nightly cost greater than ~200 USD. Some notably expensive neighborhoods include the Downtown Chinatown Penn Quarters are, Shaw Logan Circle, Georgetown, the Southwest Emplyment Area, Union Station, and West End. From my own experience living in D.C., all of these areas are near downtown, have good amenities, and are nicer neighborhoods than most.
I am now going to plot a heatmap of the average price of AirBnBs be neighborhood for (1) listings below 500 USD, (2) listings below 1500 USD, and (3) all listings.
grouped = latest_listings_below_500.groupby("neighbourhood_cleansed")
labels = pd.unique(latest_listings_below_500["neighbourhood_cleansed"].values)
labels.sort()
grouped_data = [grouped["price"].get_group( k) for k in labels]
grouped_y = grouped['price'].describe()
avg_price_per_neighborhood = list(grouped_y.iloc[:, 1])
neighborhoods_gpd['avg_price'] = avg_price_per_neighborhood
fig,ax = plt.subplots(figsize = (15,15))
base = neighborhoods_gpd.plot(ax=ax, column="avg_price",legend=True)
for idx, row in neighborhoods_gpd.iterrows():
plt.annotate(row['alpha_index'], xy=row['rep_point'],
horizontalalignment='center')
grouped = latest_listings_below_1500.groupby("neighbourhood_cleansed")
labels = pd.unique(latest_listings_below_1500["neighbourhood_cleansed"].values)
labels.sort()
grouped_data = [grouped["price"].get_group( k) for k in labels]
grouped_y = grouped['price'].describe()
avg_price_per_neighborhood = list(grouped_y.iloc[:, 1])
neighborhoods_gpd['avg_price'] = avg_price_per_neighborhood
fig,ax = plt.subplots(figsize = (15,15))
base = neighborhoods_gpd.plot(ax=ax, column="avg_price",legend=True)
for idx, row in neighborhoods_gpd.iterrows():
plt.annotate(row['alpha_index'], xy=row['rep_point'],
horizontalalignment='center')
grouped = latest_listings.groupby("neighbourhood_cleansed")
labels = pd.unique(latest_listings["neighbourhood_cleansed"].values)
labels.sort()
grouped_data = [grouped["price"].get_group( k) for k in labels]
grouped_y = grouped['price'].describe()
avg_price_per_neighborhood = list(grouped_y.iloc[:, 1])
neighborhoods_gpd['avg_price'] = avg_price_per_neighborhood
fig,ax = plt.subplots(figsize = (15,15))
base = neighborhoods_gpd.plot(ax=ax, column="avg_price",legend=True)
for idx, row in neighborhoods_gpd.iterrows():
plt.annotate(row['alpha_index'], xy=row['rep_point'],
horizontalalignment='center')
Clearly, the outliers in the all_listings table skew the coloring of the map. However, all maps align with what was predicted in that areas near downtown and/or the NW quadrant of the city (especially near the Potomac) are most expensive. Areas in the NE and SE quadrant, especially areas further away radially from Downtown are less expensive.
latest_listings["years_host"] = (res - latest_listings.host_since) / np.timedelta64(1,'Y')
latest_listings['price'].isnull().values.any()
False
We can definitely see that price varies by neighborhood. Northwest DC and the downtown area are more expensive while northeast DC and areas further away from Downtown are cheaper. We expect that neighborhood would be useful in a regression however with there being so many categories, we also expect that we could extract the same relevant information as it relates to price by using other location-based variables like latitude and longitude.
host_since vs price ¶We are now going to look at the relationship between how long someone has been a host and the price of their listings. I don't expect there to be any kind of strong correlation between the two.
latest_listings_non_null_host_since = latest_listings[latest_listings["years_host"].notnull()]
correlation(latest_listings_non_null_host_since, "price", "years_host")
Correlation coefficients: r = -0.003247102696335248 (very weak) rho = -0.08698568268122552 (very weak)
As I expected, there is almost zero correlation. Let's look at the scatterplot
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_since.years_host, latest_listings_non_null_host_since.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("years_host")
axes.set_title("Scatter Plot of price vs. years_host")
plt.show()
plt.close()
This plot shows the average price is almost a straight line across all years being a host. The outliers making it hard to see detail. Let's zoom in:
latest_listings_below_500["years_host"] = (res - latest_listings_below_500.host_since) / np.timedelta64(1,'Y')
latest_listings_non_null_host_since_below_500 = latest_listings_below_500[latest_listings_below_500["years_host"].notnull()]
correlation(latest_listings_non_null_host_since_below_500, "price", "years_host")
Correlation coefficients: r = -0.10565400632052657 (very weak) rho = -0.11097828442186954 (very weak)
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_since_below_500.years_host, latest_listings_non_null_host_since_below_500.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("years_host")
axes.set_title("Scatter Plot of price vs. years_host")
plt.show()
plt.close()
Even when we zoom in, there is clearly no correlation between year_host and price.
host_location vs price ¶This one won't work because the data is ambiguous in category. There are several categories meaning the same thing like "DC" and "Washington DC" or "United States" and "US" all being separate categories.
host_response_time vs price ¶I don't expect there to be a direct, causal relationship between these two variables. However, higher host response times likely correspond to a higher rating and overall stay experience which would in turn give the host leverage to raise the price of their AirBnB so I expect the relationship to be a slightly positive correlation.
describe_by_category(latest_listings_below_500, "price", "host_response_time", transpose=True)
host_response_time N/A a few days or more within a day \ count 2641.000000 182.000000 641.000000 mean 137.946611 124.197802 134.978159 std 94.925205 82.010068 81.423830 min 0.000000 25.000000 10.000000 25% 75.000000 65.000000 77.000000 50% 105.000000 100.000000 110.000000 75% 174.000000 150.000000 170.000000 max 499.000000 485.000000 499.000000 host_response_time within a few hours within an hour count 1047.000000 5344.000000 mean 117.108883 162.850299 std 82.213391 88.979790 min 22.000000 0.000000 25% 57.000000 95.000000 50% 95.000000 145.000000 75% 150.000000 210.000000 max 492.000000 499.000000
grouped = latest_listings_below_500.groupby("host_response_time")
figure = plt.figure(figsize=(20, 12))
axes = figure.add_subplot(2, 3, 1)
axes.hist(grouped["price"].get_group("N/A"),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.025))
axes.set_title("N/A price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(2, 3, 2)
axes.hist(grouped["price"].get_group("a few days or more"),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.025))
axes.set_title("A few days or more price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(2, 3, 3)
axes.hist(grouped["price"].get_group("within a day"),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.025))
axes.set_title("within a day price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(2, 3, 4)
axes.hist(grouped["price"].get_group("within a few hours"),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.025))
axes.set_title("within a few hours price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(2, 3, 5)
axes.hist(grouped["price"].get_group("within an hour"),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.025))
axes.set_title("within an hour price distribution")
axes.set_xlabel("price")
plt.show()
plt.close()
Generally, the faster a host responds, the most expensive the AirBnB is nd the overall distribution of price is the same shape generally across all catogories.
host_response_rate vs price ¶Similar to host response time, I expect host response rate has a positive correlation with rating which in turn gives the host leverage to raise the price of their listing.
latest_listings_below_500['host_response_rate'] = latest_listings_below_500['host_response_rate'].replace('N/A',np.NaN)
latest_listings_below_500['host_response_rate'] = latest_listings_below_500['host_response_rate'].replace('%', '', regex=True).astype(float)
latest_listings_below_500['host_response_rate'].isnull().values.any()
True
latest_listings_non_null_host_response_rate_below_500 = latest_listings_below_500[latest_listings_below_500["host_response_rate"].notnull()]
correlation(latest_listings_non_null_host_response_rate_below_500, "price", "host_response_rate")
Correlation coefficients: r = 0.07771063276548656 (very weak) rho = 0.11987951077842735 (very weak)
The correlation is very weak but positive overall
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_response_rate_below_500.host_response_rate, latest_listings_non_null_host_response_rate_below_500.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("host response rate")
axes.set_title("Scatter Plot of price vs. Host Reponse Rate")
plt.show()
plt.close()
The scatterplot shows this slight positive correlation. The handful of 0% response rate hosts might be skewing this distribution.
host_acceptance_rate vs price ¶Similar to host response time and rate, I expect host acceptance rate has a positive correlation with rating which in turn gives the host leverage to raise the price of their listing.
latest_listings_below_500['host_acceptance_rate'] = latest_listings_below_500['host_acceptance_rate'].replace('N/A',np.NaN)
latest_listings_below_500['host_acceptance_rate'] = latest_listings_below_500['host_acceptance_rate'].replace('%', '', regex=True).astype(float)
latest_listings_below_500['host_response_rate'].isnull().values.any()
True
latest_listings_non_null_host_acceptance_rate_below_500 = latest_listings_below_500[latest_listings_below_500["host_acceptance_rate"].notnull()]
correlation(latest_listings_non_null_host_acceptance_rate_below_500, "price", "host_acceptance_rate")
Correlation coefficients: r = 0.14181600177900086 (very weak) rho = 0.07371328158619253 (very weak)
The correlation is very weak but overall positive
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_acceptance_rate_below_500.host_acceptance_rate, latest_listings_non_null_host_acceptance_rate_below_500.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("host acceptance rate")
axes.set_title("Scatter Plot of price vs. Host Acceptance Rate")
plt.show()
plt.close()
The scatterplot shows this slight positive correlation. The handful of 0% response rate hosts might be skewing this distribution.
host_is_superhost vs price ¶A host being a superhost requires that they have many positive traits. This would certainly give the host leverage to raise the price of their AirBnB. I expect there to be a weak positive correlation between the host being a superhost and price.
latest_listings_below_500['host_is_superhost'].isnull().values.any()
True
latest_listings_non_null_host_is_superhost_below_500 = latest_listings_below_500[latest_listings_below_500["host_is_superhost"].notnull()]
describe_by_category(latest_listings_below_500, "price", "host_is_superhost", transpose=True)
host_is_superhost False True count 7509.000000 2499.000000 mean 149.125982 146.375350 std 92.025995 84.974677 min 0.000000 28.000000 25% 80.000000 89.000000 50% 125.000000 124.000000 75% 200.000000 175.000000 max 499.000000 499.000000
There actually appears to be a very very weak negative correlation. The average price of listings hosted by superhosts is slightly less than the average price of listings hosted by non superhosts.
grouped = latest_listings_below_500.groupby("host_is_superhost")
figure = plt.figure(figsize=(20, 6))
axes = figure.add_subplot(1,2, 1)
axes.hist(grouped["price"].get_group(False),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("is NOT Superhost distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(1,2, 2)
axes.hist(grouped["price"].get_group(True),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("Is Superhost price distribution")
axes.set_xlabel("price")
plt.show()
plt.close()
The overall distribution across the two categories is comparable.
host_listings_count vs price ¶Host listings count has some potential plausible interactions with listing price. For one, hosts with multiple listings might be more real-estate savvy and know how to prep a home that is worth a higher nightly cost. Or these hosts might only be able to afford multiple listings because they are all cheap real-estate which would mean these two variables have a negative correlation.
latest_listings_below_500['host_listings_count'].isnull().values.any()
True
latest_listings_non_null_host_listings_count_below_500 = latest_listings_below_500[latest_listings_below_500["host_listings_count"].notnull()]
correlation(latest_listings_non_null_host_listings_count_below_500 , "price", "host_listings_count")
Correlation coefficients: r = 0.14389749392647586 (very weak) rho = 0.14816794097130107 (very weak)
There is a positive correlation between these two variables. Hosts with more listings are more likely to have pricer AirBnBs.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_listings_count_below_500.host_listings_count, latest_listings_non_null_host_listings_count_below_500.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("host listings count")
axes.set_title("Scatter Plot of price vs. Host Listings Count")
plt.show()
plt.close()
This scatterplot is a bit hard to read because of the outlying host counts at 4000+ listings. We can't really see a positive correlation. Let's zoom in to hosts with 0-20 listings.
latest_listings_below_500['host_listings_count_under_20'] = latest_listings_below_500.host_listings_count[latest_listings_below_500.host_listings_count < 20]
latest_listings_non_null_host_listings_under_20_count_below_500 = latest_listings_below_500[latest_listings_below_500["host_listings_count_under_20"].notnull()]
correlation(latest_listings_non_null_host_listings_under_20_count_below_500 , "price", "host_listings_count_under_20")
Correlation coefficients: r = 0.03678169211730996 (very weak) rho = -0.010678801308667132 (very weak)
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter( latest_listings_non_null_host_listings_under_20_count_below_500.host_listings_count_under_20, latest_listings_non_null_host_listings_under_20_count_below_500.price, marker="o", color="darkslategray")
axes.set_ylabel("price")
axes.set_xlabel("host listings count")
axes.set_title("Scatter Plot of price vs. Host Listings Count (Under 20)")
plt.show()
plt.close()
The scarcity of hosts with more than 5 listings is making this distribution almost appear to be trending downward. However, knowing that most data points exist between 0 and 2.5 listings we see that there is a slight positive correlation.
host_total_listings_count vs price ¶UNUSED
host_verifications vs price ¶In a future iteration, this information could be pulled out in a useful way and explored.
latest_listings.host_verifications.value_counts(normalize=True)[:10]
['email', 'phone'] 0.550095 ['email', 'phone', 'work_email'] 0.136080 ['phone'] 0.051705 ['email', 'phone', 'reviews', 'kba'] 0.024716 ['email', 'phone', 'reviews', 'jumio', 'government_id'] 0.016004 None 0.015341 ['email', 'phone', 'reviews'] 0.014583 ['email', 'phone', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] 0.010511 ['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'government_id'] 0.010133 ['email', 'phone', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] 0.009280 Name: host_verifications, dtype: float64
host_has_profile_pic vs price ¶Hosts are required to have a profile picture. The hosts that don't likely have junk and/or null data. Regardless, hosts with profile pictures could be interpreted as more trustable than hosts without because of the personal element and identity verification element thus I expect a slight positive correlation.
describe_by_category(latest_listings_below_500, "price", "host_has_profile_pic", transpose=True)
host_has_profile_pic False True count 69.000000 9786.000000 mean 142.130435 148.837114 std 97.503380 90.770977 min 0.000000 0.000000 25% 80.000000 82.000000 50% 107.000000 125.000000 75% 161.000000 196.000000 max 499.000000 499.000000
There is a ~6 USD difference in mean listing price favoring hosts with a profile picture.
grouped = latest_listings_below_500.groupby("host_has_profile_pic")
figure = plt.figure(figsize=(20, 6))
axes = figure.add_subplot(1,2, 1)
axes.hist(grouped["price"].get_group(False),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("DOES NOT Have Profile Picture - Price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(1,2, 2)
axes.hist(grouped["price"].get_group(True),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("HAS Profile Picture - Price distribution")
axes.set_xlabel("price")
plt.show()
plt.close()
The two groups have similar distributions. There are much fewer host without a profile picture so the distribution looks a bit rougher.
host_identity_verified vs price ¶Hosts having their identity verified could be perceived as more trustable and therefore customers might feel more inclined to give them money and they have leverage to increase the price of their listings. For that reason, I expect the correlation to be weak and positive.
describe_by_category(latest_listings_below_500, "price", "host_identity_verified", transpose=True)
host_identity_verified False True count 1577.000000 8278.000000 mean 141.034876 150.267577 std 91.445010 90.626311 min 0.000000 0.000000 25% 78.000000 84.000000 50% 111.000000 126.000000 75% 179.000000 198.000000 max 499.000000 499.000000
As expected, the distribution is weak but positive.
grouped = latest_listings_below_500.groupby("host_identity_verified")
figure = plt.figure(figsize=(20, 6))
axes = figure.add_subplot(1,2, 1)
axes.hist(grouped["price"].get_group(False),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("DOES NOT Has Host Identity Verification - Price distribution")
axes.set_xlabel("price")
axes = figure.add_subplot(1,2, 2)
axes.hist(grouped["price"].get_group(True),color="darkslategray",density=True, range=(0,500))
axes.set_ylim((0,0.01))
axes.set_title("HAS Host Identity Verification - Price distribution")
axes.set_xlabel("price")
plt.show()
plt.close()
The distributions for both are unimodal and skewed to the right - both hosts with and without identity verification look very similar in distribution of price.
Many of the host information variables have little to no correlation or predictive power with price. A few have very weak to weak correlations that could be potentially useful like
host_has_profile_picturehost_acceptance_ratehost_identity_verifiedhost_listings_countAnd the correlations that we saw with these variables made sense with what we expected from domain knowledge. However, among the above variables with non-negligible correlations, only two have a tolerable either (1) split in category or (2) a reasonable distribution without too many outliers and those are
host_listings_counthost_identity_verifiedIn this section, we'll look at the relationship between the room descriptors and the target variable, price.
We expect an "entire home/apt" to price higher than a "private" room or "hotel room", which would price higher than a "shared room."
max_plot_price = 26000
fig = plt.figure(figsize=(11,15))
ax1 = fig.add_subplot(1,1,1)
sns.boxplot(x = 'room_type', y='price', data = listings_df)
ax1.set_yticks(range(0,max_plot_price,500))
plt.show()
The extreme values, especially the $25,000 listing, makes it hard to see the distribution of prices per room type.
These extreme values are real prices, and I am hesitant to throw them out.
We'll zoom into the majority of data points below.
def plot_box_cat(plot_df, x_name, y_name, y_lim, y_step):
fig = plt.figure(figsize=(9,12))
ax1 = fig.add_subplot(1,1,1)
sns.boxplot(x = x_name, y=y_name, data = plot_df)
ax1.set_yticks(range(0,y_lim,y_step))
ax1.set(ylim=[0,y_lim])
plt.show()
max_plot_price = 500
#plot_df = listings_df[listings_df.price < max_plot_price]
plot_df = listings_df[listings_df.price > 0]
plot_box_cat(plot_df, 'room_type', 'price', max_plot_price, 100)
As expected, the Entire home/apt is the most pricey. This is followed by private room. The shared room median is above the hotel room median. I expected the hotel room to follow the private room trend, but it did not. With 28 of the total ~10,500 listings being hotel rooms, there may be some issues here. todo explain
We expect the price to increase as the maximum accomodation for a listing increases.
fig = plt.figure(figsize=(11,22))
ax1 = fig.add_subplot(1,1,1)
sns.boxplot(x = 'accommodates', y='price', data = listings_df)
ax1.set_yticks(range(0,26000,500))
plt.show()
Once again, the extreme values make it difficult to see the majority of the distributions. We'll limit the view in the next plot:
max_plot_price = 1500
#plot_df = listings_df[listings_df.price < max_plot_price]
plot_df = listings_df[listings_df.price > 0]
plot_box_cat(plot_df, 'accommodates', 'price', max_plot_price, 100)
The mean price tends to increase as the maximum accommodations increase. There is a wider distribution of prices as maximum accomodations increases as well.
There may be another variable that can explain the distribution.
correlation(plot_df, "price", "accommodates")
Correlation coefficients: r = 0.23975703067904652 (weak) rho = 0.5594564264798472 (moderate)
The Pearson's correlation coefficient is .24 (weak), and the spearman's correlation coefficient is .56 (moderate). There is an increasing monotonic relationship.
As with accomodations, we expect the price to increase as the number of bedrooms increases.
fig = plt.figure(figsize=(9,15))
ax1 = fig.add_subplot(1,1,1)
sns.boxplot(x = 'bedrooms', y='price', data = listings_df)
ax1.set_yticks(range(0,26500,500))
plt.show()
We can generally see an increase in the overall distribution of prices as we go up in bedrooms until about 8 bedrooms. There are more outliers from 1-4 bedrooms.
It's hard to see the majority if the distributions, so we'll zoom in:
max_plot_price = 2000
#plot_df = listings_df[listings_df.price < max_plot_price]
plot_df = listings_df[listings_df.price > 0]
plot_box_cat(plot_df, 'bedrooms', 'price', max_plot_price, 100)
Here we can see rising averages and rising distributions as bedrooms increases.
plot_df = listings_df[listings_df[['price','bedrooms']].notnull().all(1)]
correlation(plot_df, "price", "bedrooms")
plot_df.shape
Correlation coefficients: r = 0.3480348197650736 (low) rho = 0.5373803796400258 (moderate)
(9600, 9)
Here the pearson's correlation coefficient is .35 (low), and the spearman's correlation coefficient is .54 (moderate). This may be a good variable to use in modeling.
Note that there were around 800 rows that did not have data for number of bedrooms.
As with accomodations, we expect the price to increase as the number of beds increases.
fig = plt.figure(figsize=(9,15))
ax1 = fig.add_subplot(1,1,1)
sns.boxplot(x = 'beds', y='price', data = listings_df)
ax1.set_yticks(range(0,26500,500))
plt.show()
Once again, this is difficult to make out due to the outliers compressing the other data in the plot. We'll zoom in further.
max_plot_price = 2000
plot_df = listings_df[listings_df.price > 0]
plot_box_cat(plot_df, 'beds', 'price', max_plot_price, 100)
This plot looks similar to the number of bedrooms.
plot_df = listings_df[listings_df[['price','beds']].notnull().all(1)]
correlation(plot_df, "price", "beds")
plot_df.shape
Correlation coefficients: r = 0.19648876526139203 (weak) rho = 0.4389629698933788 (low)
(10291, 9)
This shows a weak Pearson's correlation (.2), and a low (.44) Spearman's correlation. The number of bedrooms had higher coefficients for both of these.
The numeric values are all related: accommodates, bathrooms, bedrooms, beds. The more people you can fit in, the higher the price is generally. The function criteria for "accommodates" isn't listed, but it should be a function of the number of bedrooms and bathrooms. Overall, as these increase, so does price.
There were a number of rows that should be thrown away as errors because they cannot be true, and we have no way to correct them in the data. For the following criteria, we will drop the data:
The price comparison with review scores accuracy. We have have some null values in our data that we will remove and only look at non-null reviews.
latest_listings_review_scores_accuracy = latest_listings[latest_listings[["review_scores_rating"]].notnull().all(1)]
Let's look at the correlation coefficient.
print("r = ", stats.pearsonr(latest_listings_review_scores_accuracy.price, latest_listings_review_scores_accuracy.review_scores_rating)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_accuracy.price, latest_listings_review_scores_accuracy.review_scores_rating)[0])
r = -0.0105822256939008 rho = 0.07745461501588657
Pearson's correlation coefficient, $r$, is only -0.01 which indicates a weak inverse relationship. Spearman's is interesting at 0.07 which might indicate that there is least a weak monotonic relationship. This is interesting because if means that as scores go up price goes down. I'm not sure why that would be the case.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_accuracy.review_scores_rating, latest_listings_review_scores_accuracy.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores Accuracy")
axes.set_title("Scatter Plot of Price vs. Accuracy Review Scores")
plt.show()
plt.close()
We can see a slight pattern as review score goes up we see price going up, but it mostly it looks very uniform.
The price comparison with review scores cleanliness. We have have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_review_scores_cleanliness = latest_listings[latest_listings[["review_scores_cleanliness"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_review_scores_cleanliness.price, latest_listings_review_scores_cleanliness.review_scores_cleanliness)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_cleanliness.price, latest_listings_review_scores_cleanliness.review_scores_cleanliness)[0])
r = 0.025898444513533313 rho = 0.10561959800641402
Pearson's correlation coefficient, 𝑟 , is only 0.02 which indicates a weak relationship. Spearman's is interesting at 0.105 which might indicate a monotonic relationship. This is interesting because it means that as cleanliness scores go up price goes up.
I think that does make sense, as cleaning fees have become very pricey in AirBnB and has led to a lot of backlash for the service. It would seem plausible that a listing that is clean enough to earn a high score charges more for due to the additional costs to clean the listing to such a high standard.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_cleanliness.review_scores_cleanliness, latest_listings_review_scores_cleanliness.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores cleanliness")
axes.set_title("Scatter Plot of Price vs. cleanliness Review Scores")
plt.show()
plt.close()
These are all very similar graphics due to the way the ratings are all very similar as we saw from EDA.
The price comparison to checkin reviews. We have have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_review_scores_checkin = latest_listings[latest_listings[["review_scores_checkin"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_review_scores_checkin.price, latest_listings_review_scores_checkin.review_scores_checkin)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_checkin.price, latest_listings_review_scores_checkin.review_scores_checkin)[0])
r = 0.007325971384927467 rho = 0.03134393309356987
Pearson's correlation coefficient, 𝑟 , is only 0.007 which indicates an extremely weak relationship. Spearman's is slightly higher at 0.031 which might indicate a weak monotonic relationship. I think checkin factors into price but it is very small, because most checkin processes are not that difficult. Like at a hotel, checkin process tends to be a very standard and unexciting process.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_checkin.review_scores_checkin, latest_listings_review_scores_checkin.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores Checkin")
axes.set_title("Scatter Plot of Price vs. Checkin Review Scores")
plt.show()
plt.close()
The pattern is very similar to the other variables. The effect is so small we don't observe it from the data.
The price comparison to communication reviews. We have have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_review_scores_communication = latest_listings[latest_listings[["review_scores_communication"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_review_scores_communication.price, latest_listings_review_scores_communication.review_scores_communication)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_communication.price, latest_listings_review_scores_communication.review_scores_communication)[0])
r = 0.0031412207072375217 rho = 0.015591363504826119
Pearson's correlation coefficient, 𝑟 , is only 0.0031 which indicates an extremely weak relationship. Spearman's is interesting at 0.01 which might indicate a very weak monotonic relationship. This score impacts prices a little more than checkin. My assumption is that communication is important, but much of that interaction happens via existing documents that a host provides in the home and through the app.
The recommendation from AirBnB is that hosts provide most relevant information in document form such as WiFi passwords, etc...This reduces the amount of actual communication with the host, thus leading to only brief interactions.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_communication.review_scores_communication, latest_listings_review_scores_communication.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores Communication")
axes.set_title("Scatter Plot of Price vs. Communication Review Scores")
plt.show()
plt.close()
This graph shows a inclined slope as scores increase from 4 - 5, we see our prices increase as well, but only slightly for a small grouping of listings.
The price comparison to location reviews. This is an important variable for our analysis because we predict that it has the most impact on price. Our assumption is that a lot of different qualities of an AirBnB is captured in the location. We have have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_review_scores_location = latest_listings[latest_listings[["review_scores_location"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_review_scores_location.price, latest_listings_review_scores_location.review_scores_location)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_location.price, latest_listings_review_scores_location.review_scores_location)[0])
r = 0.056984325513637246 rho = 0.1973334679991015
Pearson's correlation coefficient, 𝑟 , is 0.05 which indicates a weak relationship, however it is higher than some of our other review variables. Spearman's is interesting at 0.19 which indicates a moderate monotonic relationship. I think this confirms our assumption that location is very important in the pricing strategy of an AirBnB and thus the location review score captures that more clearly.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_location.review_scores_location, latest_listings_review_scores_location.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores Communication")
axes.set_title("Scatter Plot of Price vs. Location Review Scores")
plt.show()
plt.close()
This graph shows a inclined slope as scores increase from 4 - 5, we see our prices increase as well, but only slightly for a small grouping of listings.
The price comparison to value reviews. We have have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_review_scores_value = latest_listings[latest_listings[["review_scores_value"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_review_scores_value.price, latest_listings_review_scores_value.review_scores_value)[0])
print("rho = ", stats.spearmanr(latest_listings_review_scores_value.price, latest_listings_review_scores_value.review_scores_value)[0])
r = -0.00832520722754418 rho = -0.02714469810249014
Pearson's correlation coefficient, 𝑟 , is only -0.008 which indicates a weak inverse relationship. Spearman's is interesting at -0.027 which indicates a very weak inverse monotonic relationship. This is interesting because it means that as the value score decrease the price increase. I think makes sense from a 'value' perspective, most guests would consider a range of prices that undercuts hotels or other lodging options, thus that would indicate higher value and a lower price.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_review_scores_value.review_scores_value, latest_listings_review_scores_value.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Review Scores Value")
axes.set_title("Scatter Plot of Price vs. Value Review Scores")
plt.show()
plt.close()
The price comparison to number of reviews. We have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_number_of_reviews = latest_listings[latest_listings[["number_of_reviews"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_number_of_reviews.price, latest_listings_number_of_reviews.number_of_reviews)[0])
print("rho = ", stats.spearmanr(latest_listings_number_of_reviews.price, latest_listings_number_of_reviews.number_of_reviews)[0])
r = -0.05603365411307578 rho = -0.15113808981420673
Pearson's correlation coefficient, 𝑟 , is only -0.056 which indicates a weak inverse relationship. Spearman's is interesting at -0.15 which indicates a very weak inverse monotonic relationship. It's strange that as we have more reviews price goes down. I wonder if that indicates that as a listing gains more reviews and thus more guest, the revenue generated by the host allows them to lower their prices.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_number_of_reviews.number_of_reviews, latest_listings_number_of_reviews.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Number of Review Scores")
axes.set_title("Scatter Plot of Price vs. Number of Review Scores")
plt.show()
plt.close()
This graph showcases very clearly what we got above with the coefficients. There is a decline in price as the number of review scores increase.
The price comparison to number of reviews in the last 30 days. We have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_number_of_reviews_l30d = latest_listings[latest_listings[["number_of_reviews"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_number_of_reviews_l30d.price, latest_listings_number_of_reviews_l30d.number_of_reviews_l30d)[0])
print("rho = ", stats.spearmanr(latest_listings_number_of_reviews_l30d.price, latest_listings_number_of_reviews_l30d.number_of_reviews_l30d)[0])
r = -0.038392251877214 rho = -0.024724854697841612
Pearson's correlation coefficient, 𝑟 , is only -0.038 which indicates a weak inverse relationship. Spearman's is interesting at -0.025 which indicates there is a weak inverse monotonic relationship.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_number_of_reviews_l30d.number_of_reviews_l30d, latest_listings_number_of_reviews_l30d.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Number of Review Scores Last 30 Days")
axes.set_title("Scatter Plot of Price vs. Number of Reviews Last 30 days Scores")
plt.show()
plt.close()
This graph showcases very clearly what we got above with the coefficients. There is a decline in price as the number of review scores in the last 30 days increase. It does track with our analysis of the total reviews to price that we previously saw.
The price comparison to number of reviews per month. We have some null values in our data that we will remove and only look at non-null reviews.
Let's look at the correlation coefficient.
latest_listings_reviews_per_month = latest_listings[latest_listings[["reviews_per_month"]].notnull().all(1)]
print("r = ", stats.pearsonr(latest_listings_reviews_per_month.price, latest_listings_reviews_per_month.reviews_per_month)[0])
print("rho = ", stats.spearmanr(latest_listings_reviews_per_month.price, latest_listings_reviews_per_month.reviews_per_month)[0])
r = -0.01878010369013164 rho = 0.0951775273308491
Pearson's correlation coefficient, 𝑟 , is only -0.019 which indicates an extremely weak inverse relationship. Spearman's is interesting at 0.096 which indicates a weak monotonic relationship. Similar to what we saw in other nominal review totals.
figure = plt.figure(figsize=(10, 10))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(latest_listings_reviews_per_month.reviews_per_month, latest_listings_reviews_per_month.price, marker="o", color="darkslategray")
axes.set_ylabel("Price")
axes.set_xlabel("Number of Reviews Per Month Value")
axes.set_title("Scatter Plot of Price vs. Number of Reviews Per Month Scores")
plt.show()
plt.close()
We expect that areas with higher crime will be less desirable, leading to a lower price. There is the possibility that areas more heavily affected by crime will not be represented in the data set, resulting in a survivorship bias.
total crimes vs price ¶The first comparison of crime and price will be the total count of crimes per neighborhood to price.
rel = con.execute("select * from crimes")
crimes = rel.df()
crimes
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | METHOD | LATITUDE | NEIGHBORHOOD_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | others | 38.897331 | Union Station, Stanton Park, Kingman Park |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | others | 38.881269 | Capitol Hill, Lincoln Park |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | gun | 38.858661 | Fairfax Village, Naylor Gardens, Hillcrest, Su... |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | others | 38.831632 | Congress Heights, Bellevue, Washington Highlands |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | others | 38.922290 | Columbia Heights, Mt. Pleasant, Pleasant Plain... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27606 | 27606 | cluster 1 | property | -77.042324 | 7/21/2022, 7:00:00 PM | theft/other | midnight | property|theft/other | 2022-07-21 | theft/other | others | 38.926303 | Kalorama Heights, Adams Morgan, Lanier Heights |
| 27607 | 27607 | cluster 1 | property | -77.042369 | 8/12/2022, 6:30:00 AM | theft/other | midnight | property|theft/other | 2022-08-11 | theft/other | others | 38.924303 | Kalorama Heights, Adams Morgan, Lanier Heights |
| 27608 | 27608 | cluster 26 | property | -76.994963 | 8/12/2022, 9:00:00 AM | theft f/auto | midnight | property|theft f/auto | 2022-08-12 | theft f/auto | others | 38.891322 | Capitol Hill, Lincoln Park |
| 27609 | 27609 | cluster 11 | property | -77.081165 | 5/31/2022, 7:00:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-26 | theft f/auto | others | 38.955947 | Friendship Heights, American University Park, ... |
| 27610 | 27610 | cluster 25 | property | -77.001314 | 5/31/2022, 11:37:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-27 | theft f/auto | others | 38.898907 | Union Station, Stanton Park, Kingman Park |
27611 rows × 13 columns
A new column is added that has the count of crimes for each neighborhood.
crimes["total_crimes"] = crimes.groupby("NEIGHBORHOOD_NAME")["NEIGHBORHOOD_NAME"].transform('count')
crimes
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | METHOD | LATITUDE | NEIGHBORHOOD_NAME | total_crimes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | others | 38.897331 | Union Station, Stanton Park, Kingman Park | 2070 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | others | 38.881269 | Capitol Hill, Lincoln Park | 1095 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | gun | 38.858661 | Fairfax Village, Naylor Gardens, Hillcrest, Su... | 315 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | others | 38.831632 | Congress Heights, Bellevue, Washington Highlands | 990 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | others | 38.922290 | Columbia Heights, Mt. Pleasant, Pleasant Plain... | 1788 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27606 | 27606 | cluster 1 | property | -77.042324 | 7/21/2022, 7:00:00 PM | theft/other | midnight | property|theft/other | 2022-07-21 | theft/other | others | 38.926303 | Kalorama Heights, Adams Morgan, Lanier Heights | 729 |
| 27607 | 27607 | cluster 1 | property | -77.042369 | 8/12/2022, 6:30:00 AM | theft/other | midnight | property|theft/other | 2022-08-11 | theft/other | others | 38.924303 | Kalorama Heights, Adams Morgan, Lanier Heights | 729 |
| 27608 | 27608 | cluster 26 | property | -76.994963 | 8/12/2022, 9:00:00 AM | theft f/auto | midnight | property|theft f/auto | 2022-08-12 | theft f/auto | others | 38.891322 | Capitol Hill, Lincoln Park | 1095 |
| 27609 | 27609 | cluster 11 | property | -77.081165 | 5/31/2022, 7:00:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-26 | theft f/auto | others | 38.955947 | Friendship Heights, American University Park, ... | 365 |
| 27610 | 27610 | cluster 25 | property | -77.001314 | 5/31/2022, 11:37:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-27 | theft f/auto | others | 38.898907 | Union Station, Stanton Park, Kingman Park | 2070 |
27611 rows × 14 columns
Next we want to be able to compare prices to crime so we have to add prices to the same dataframe. As the listing ids aren't going to be useful for the crimes database, the average for each neighborhood will be used. Any rows with NAs are dropped.
prices = latest_listings[["id", "price", "neighbourhood_cleansed"]]
prices = prices.dropna()
prices
| id | price | neighbourhood_cleansed | |
|---|---|---|---|
| 0 | 3686 | 67.0 | Historic Anacostia |
| 1 | 841260 | 39.0 | Brightwood Park, Crestwood, Petworth |
| 2 | 3943 | 75.0 | Edgewood, Bloomingdale, Truxton Circle, Eckington |
| 3 | 842418 | 250.0 | Downtown, Chinatown, Penn Quarters, Mount Vern... |
| 4 | 860336 | 113.0 | Dupont Circle, Connecticut Avenue/K Street |
| ... | ... | ... | ... |
| 10555 | 574415195460129986 | 235.0 | Shaw, Logan Circle |
| 10556 | 582832241528964793 | 110.0 | Union Station, Stanton Park, Kingman Park |
| 10557 | 552672006345399432 | 207.0 | Colonial Village, Shepherd Park, North Portal ... |
| 10558 | 48252468 | 192.0 | Colonial Village, Shepherd Park, North Portal ... |
| 10559 | 578751728298235973 | 237.0 | Shaw, Logan Circle |
10560 rows × 3 columns
A new column is added:
prices["avg_price"] = prices.groupby("neighbourhood_cleansed")["price"].transform("mean")
prices
| id | price | neighbourhood_cleansed | avg_price | |
|---|---|---|---|---|
| 0 | 3686 | 67.0 | Historic Anacostia | 151.233766 |
| 1 | 841260 | 39.0 | Brightwood Park, Crestwood, Petworth | 142.733591 |
| 2 | 3943 | 75.0 | Edgewood, Bloomingdale, Truxton Circle, Eckington | 171.148545 |
| 3 | 842418 | 250.0 | Downtown, Chinatown, Penn Quarters, Mount Vern... | 254.383929 |
| 4 | 860336 | 113.0 | Dupont Circle, Connecticut Avenue/K Street | 262.198238 |
| ... | ... | ... | ... | ... |
| 10555 | 574415195460129986 | 235.0 | Shaw, Logan Circle | 223.815951 |
| 10556 | 582832241528964793 | 110.0 | Union Station, Stanton Park, Kingman Park | 213.681243 |
| 10557 | 552672006345399432 | 207.0 | Colonial Village, Shepherd Park, North Portal ... | 144.356784 |
| 10558 | 48252468 | 192.0 | Colonial Village, Shepherd Park, North Portal ... | 144.356784 |
| 10559 | 578751728298235973 | 237.0 | Shaw, Logan Circle | 223.815951 |
10560 rows × 4 columns
Only using the relevant columns and removing duplicates:
price_neigh = prices[["neighbourhood_cleansed", "avg_price"]]
price_neigh = price_neigh.drop_duplicates()
price_neigh
| neighbourhood_cleansed | avg_price | |
|---|---|---|
| 0 | Historic Anacostia | 151.233766 |
| 1 | Brightwood Park, Crestwood, Petworth | 142.733591 |
| 2 | Edgewood, Bloomingdale, Truxton Circle, Eckington | 171.148545 |
| 3 | Downtown, Chinatown, Penn Quarters, Mount Vern... | 254.383929 |
| 4 | Dupont Circle, Connecticut Avenue/K Street | 262.198238 |
| 5 | Howard University, Le Droit Park, Cardozo/Shaw | 203.638961 |
| 6 | Douglas, Shipley Terrace | 161.063830 |
| 7 | Capitol Hill, Lincoln Park | 230.157343 |
| 8 | Lamont Riggs, Queens Chapel, Fort Totten, Plea... | 132.696970 |
| 9 | Friendship Heights, American University Park, ... | 157.533333 |
| 11 | Kalorama Heights, Adams Morgan, Lanier Heights | 157.829480 |
| 12 | Columbia Heights, Mt. Pleasant, Pleasant Plain... | 155.922889 |
| 14 | Ivy City, Arboretum, Trinidad, Carver Langston | 166.836812 |
| 19 | Shaw, Logan Circle | 223.815951 |
| 23 | West End, Foggy Bottom, GWU | 273.766871 |
| 25 | Brookland, Brentwood, Langdon | 130.921053 |
| 27 | Cathedral Heights, McLean Gardens, Glover Park | 370.658915 |
| 28 | Takoma, Brightwood, Manor Park | 161.151163 |
| 38 | Union Station, Stanton Park, Kingman Park | 213.681243 |
| 46 | Congress Heights, Bellevue, Washington Highlands | 106.824561 |
| 51 | Colonial Village, Shepherd Park, North Portal ... | 144.356784 |
| 59 | Georgetown, Burleith/Hillandale | 249.234763 |
| 61 | Southwest Employment Area, Southwest/Waterfron... | 275.127451 |
| 64 | Spring Valley, Palisades, Wesley Heights, Foxh... | 248.885463 |
| 71 | Hawthorne, Barnaby Woods, Chevy Chase | 246.425743 |
| 78 | Capitol View, Marshall Heights, Benning Heights | 165.534722 |
| 90 | North Michigan Park, Michigan Park, University... | 107.840000 |
| 116 | Cleveland Park, Woodley Park, Massachusetts Av... | 201.939394 |
| 190 | Deanwood, Burrville, Grant Park, Lincoln Heigh... | 124.220690 |
| 233 | Fairfax Village, Naylor Gardens, Hillcrest, Su... | 123.060606 |
| 326 | Eastland Gardens, Kenilworth | 96.105263 |
| 331 | River Terrace, Benning, Greenway, Dupont Park | 196.857143 |
| 338 | North Cleveland Park, Forest Hills, Van Ness | 167.000000 |
| 441 | Near Southeast, Navy Yard | 195.355556 |
| 459 | Twining, Fairlawn, Randle Highlands, Penn Bran... | 166.392308 |
| 583 | Sheridan, Barry Farm, Buena Vista | 140.472727 |
| 595 | Mayfair, Hillbrook, Mahaning Heights | 109.968750 |
| 637 | Woodland/Fort Stanton, Garfield Heights, Knox ... | 148.666667 |
| 638 | Woodridge, Fort Lincoln, Gateway | 141.282353 |
Adding the price column to the crimes dataframe:
crimes = crimes.join(price_neigh.set_index("neighbourhood_cleansed"), on="NEIGHBORHOOD_NAME")
crimes = crimes.dropna()
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | METHOD | LATITUDE | NEIGHBORHOOD_NAME | total_crimes | avg_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | others | 38.897331 | Union Station, Stanton Park, Kingman Park | 2070 | 213.681243 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | others | 38.881269 | Capitol Hill, Lincoln Park | 1095 | 230.157343 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | gun | 38.858661 | Fairfax Village, Naylor Gardens, Hillcrest, Su... | 315 | 123.060606 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | others | 38.831632 | Congress Heights, Bellevue, Washington Highlands | 990 | 106.824561 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | others | 38.922290 | Columbia Heights, Mt. Pleasant, Pleasant Plain... | 1788 | 155.922889 |
Now we have average price and total crime counts per neighborhood.
total_crimes = crimes["total_crimes"]
avg_price = crimes["avg_price"]
print("r = ", stats.pearsonr(avg_price, total_crimes)[0])
print("rho = ", stats.spearmanr(avg_price, total_crimes)[0])
r = 0.2741294484862969 rho = 0.33422917462614293
This is not a very strong correlation, though it is positive. It's possible that higher crimes and higher prices are correlated simply because of population density, but we would need more data to determine that relationship.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_crimes, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The slight correlation is visible here, though it is slight.
total types of crimes vs price ¶This is the total of each type of crime versus price.
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | METHOD | LATITUDE | NEIGHBORHOOD_NAME | total_crimes | avg_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | others | 38.897331 | Union Station, Stanton Park, Kingman Park | 2070 | 213.681243 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | others | 38.881269 | Capitol Hill, Lincoln Park | 1095 | 230.157343 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | gun | 38.858661 | Fairfax Village, Naylor Gardens, Hillcrest, Su... | 315 | 123.060606 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | others | 38.831632 | Congress Heights, Bellevue, Washington Highlands | 990 | 106.824561 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | others | 38.922290 | Columbia Heights, Mt. Pleasant, Pleasant Plain... | 1788 | 155.922889 |
We want to look at the each offense.
crimes = pd.concat([crimes, pd.get_dummies(crimes["OFFENSE"], prefix="offense")], axis=1)
Renaming the columns.
crimes = crimes.rename(columns={"offense_assault w/dangerous weapon":"offense_assault_dangerous_weapon", "offense_motor vehicle theft":"offense_motor_vehicle_theft", "offense_sex abuse":"offense_sex_abuse", "offense_theft f/auto":"offense_theft_auto", "offense_theft/other":"offense_theft_other"})
crimes
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | avg_price | offense_arson | offense_assault_dangerous_weapon | offense_burglary | offense_homicide | offense_motor_vehicle_theft | offense_robbery | offense_sex_abuse | offense_theft_auto | offense_theft_other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 213.681243 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 230.157343 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 123.060606 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 106.824561 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 155.922889 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27606 | 27606 | cluster 1 | property | -77.042324 | 7/21/2022, 7:00:00 PM | theft/other | midnight | property|theft/other | 2022-07-21 | theft/other | ... | 157.829480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 27607 | 27607 | cluster 1 | property | -77.042369 | 8/12/2022, 6:30:00 AM | theft/other | midnight | property|theft/other | 2022-08-11 | theft/other | ... | 157.829480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 27608 | 27608 | cluster 26 | property | -76.994963 | 8/12/2022, 9:00:00 AM | theft f/auto | midnight | property|theft f/auto | 2022-08-12 | theft f/auto | ... | 230.157343 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 27609 | 27609 | cluster 11 | property | -77.081165 | 5/31/2022, 7:00:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-26 | theft f/auto | ... | 157.533333 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 27610 | 27610 | cluster 25 | property | -77.001314 | 5/31/2022, 11:37:00 AM | theft f/auto | day | property|theft f/auto | 2022-05-27 | theft f/auto | ... | 213.681243 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
25305 rows × 24 columns
crimes["OFFENSE"].unique()
array(['homicide', 'robbery', 'assault w/dangerous weapon',
'theft f/auto', 'theft/other', 'motor vehicle theft', 'burglary',
'sex abuse', 'arson'], dtype=object)
Now taking the average for each neighborhood per offense.
crimes["total_offense_arson"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_arson"].transform("sum")
crimes["total_offense_assault_dangerous_weapon"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_assault_dangerous_weapon"].transform("sum")
crimes["total_offense_burglary"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_burglary"].transform("sum")
crimes["total_offense_homicide"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_homicide"].transform("sum")
crimes["total_offense_motor_vehicle_theft"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_motor_vehicle_theft"].transform("sum")
crimes["total_offense_robbery"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_robbery"].transform("sum")
crimes["total_offense_sex_abuse"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_sex_abuse"].transform("sum")
crimes["total_offense_theft_auto"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_theft_auto"].transform("sum")
crimes["total_offense_theft_other"] = crimes.groupby("NEIGHBORHOOD_NAME")["offense_theft_other"].transform("sum")
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | offense_theft_other | total_offense_arson | total_offense_assault_dangerous_weapon | total_offense_burglary | total_offense_homicide | total_offense_motor_vehicle_theft | total_offense_robbery | total_offense_sex_abuse | total_offense_theft_auto | total_offense_theft_other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 0 | 0 | 55 | 78 | 8 | 349.0 | 139 | 12 | 521.0 | 772.0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 1 | 23 | 36 | 4 | 110.0 | 82 | 3 | 283.0 | 503.0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 0 | 13 | 8 | 3 | 42.0 | 33 | 2 | 100.0 | 85.0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 0 | 1 | 132 | 50 | 27 | 160.0 | 109 | 7 | 135.0 | 213.0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 0 | 0 | 51 | 51 | 5 | 182.0 | 131 | 6 | 516.0 | 703.0 |
5 rows × 33 columns
offense_sum = crimes["total_offense_arson"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = -0.11866420773467688 rho = -0.0629486848558514
There is a slight negative correlation between arson crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Arson offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There are too few arson offenses to really draw conclusions.
offense_sum = crimes["total_offense_assault_dangerous_weapon"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = -0.3407583655881097 rho = -0.19319826732628786
There is a slight positive correlation between assault with a dangerous weapon crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Assault with dangerous weapon offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There is a slight positive correlation. It's possible that, similar to total crimes, the more populous or dense a neighborhood is, the more crime there is and the higher the prices, but without price depending on crime or vice versa.
offense_sum = crimes["total_offense_burglary"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = 0.0760714915926916 rho = 0.12415044676181282
There is a slight positive correlation between burglary crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Burglary offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There is possibly a very slight positive correlation.
offense_sum = crimes["total_offense_homicide"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = -0.452242606629432 rho = -0.4193329909486742
There is a medium negative correlation between homicide crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Homicide offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There does appear to be a negative correlation between homicide and price.
offense_sum = crimes["total_offense_motor_vehicle_theft"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = 0.08516232954716718 rho = 0.11592847942971346
There is a slight positive correlation between motor vehicle theft crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Arson offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There does not appear to be a strong correlation.
offense_sum = crimes["total_offense_robbery"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = -0.08478055618637566 rho = 0.01861935538151312
There is a slight negative correlation between robbery crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Robbery offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There is not an obvious correlation.
offense_sum = crimes["total_offense_sex_abuse"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = 0.08486037004194204 rho = 0.21899182451401208
There is a slight positive correlation between sex abuse crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Sex abuse offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
Just based on the chart there would appear to be a slight negative correlation, though it is probably that there is not enough data or there is not a correlation.
offense_sum = crimes["total_offense_theft_auto"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = 0.28750469985174165 rho = 0.35221021086374615
There is a slight/medium positive correlation between auto theft crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Auto theft offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There appears to be a slight positive correlation.
offense_sum = crimes["total_offense_theft_other"]
print("r = ", stats.pearsonr(avg_price, offense_sum)[0])
print("rho = ", stats.spearmanr(avg_price, offense_sum)[0])
r = 0.427154432329103 rho = 0.47560561790571704
There is a medium positive correlation between other theft crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(offense_sum, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Other theft offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There does appear to be a slight positive correlation.
Overall there appear to be some crimes that do show a correlation between average price per neighborhood and total offenses per neighborhood. There are some offenses that do not appear to have a correlation. However, we don't know if there are unknown confounding variables, so it may be unhelpful to use these variables in the model.
offense types vs price ¶This is if the crime is a property crime or a violent crimes. The sums of each per neighbohood will be added as columns.
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | offense_theft_other | total_offense_arson | total_offense_assault_dangerous_weapon | total_offense_burglary | total_offense_homicide | total_offense_motor_vehicle_theft | total_offense_robbery | total_offense_sex_abuse | total_offense_theft_auto | total_offense_theft_other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 0 | 0 | 55 | 78 | 8 | 349.0 | 139 | 12 | 521.0 | 772.0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 1 | 23 | 36 | 4 | 110.0 | 82 | 3 | 283.0 | 503.0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 0 | 13 | 8 | 3 | 42.0 | 33 | 2 | 100.0 | 85.0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 0 | 1 | 132 | 50 | 27 | 160.0 | 109 | 7 | 135.0 | 213.0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 0 | 0 | 51 | 51 | 5 | 182.0 | 131 | 6 | 516.0 | 703.0 |
5 rows × 33 columns
Confirming that there are just the two expected values.
crimes["offensegroup"].unique()
array(['violent', 'property'], dtype=object)
Adding dummy variables.
crimes = pd.concat([crimes, pd.get_dummies(crimes["offensegroup"], prefix="off_type")], axis=1)
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | total_offense_assault_dangerous_weapon | total_offense_burglary | total_offense_homicide | total_offense_motor_vehicle_theft | total_offense_robbery | total_offense_sex_abuse | total_offense_theft_auto | total_offense_theft_other | off_type_property | off_type_violent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 55 | 78 | 8 | 349.0 | 139 | 12 | 521.0 | 772.0 | 0 | 1 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 23 | 36 | 4 | 110.0 | 82 | 3 | 283.0 | 503.0 | 0 | 1 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 13 | 8 | 3 | 42.0 | 33 | 2 | 100.0 | 85.0 | 0 | 1 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 132 | 50 | 27 | 160.0 | 109 | 7 | 135.0 | 213.0 | 0 | 1 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 51 | 51 | 5 | 182.0 | 131 | 6 | 516.0 | 703.0 | 1 | 0 |
5 rows × 35 columns
Summing the values for each type per neighborhood.
crimes["total_property"] = crimes.groupby("NEIGHBORHOOD_NAME")["off_type_property"].transform("sum")
crimes["total_violent"] = crimes.groupby("NEIGHBORHOOD_NAME")["off_type_violent"].transform("sum")
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | total_offense_homicide | total_offense_motor_vehicle_theft | total_offense_robbery | total_offense_sex_abuse | total_offense_theft_auto | total_offense_theft_other | off_type_property | off_type_violent | total_property | total_violent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 8 | 349.0 | 139 | 12 | 521.0 | 772.0 | 0 | 1 | 1720.0 | 214.0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 4 | 110.0 | 82 | 3 | 283.0 | 503.0 | 0 | 1 | 933.0 | 112.0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 3 | 42.0 | 33 | 2 | 100.0 | 85.0 | 0 | 1 | 235.0 | 51.0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 27 | 160.0 | 109 | 7 | 135.0 | 213.0 | 0 | 1 | 559.0 | 275.0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 5 | 182.0 | 131 | 6 | 516.0 | 703.0 | 1 | 0 | 1452.0 | 193.0 |
5 rows × 37 columns
total_type = crimes["total_property"]
print("r = ", stats.pearsonr(avg_price, total_type)[0])
print("rho = ", stats.spearmanr(avg_price, total_type)[0])
r = 0.3511584011434341 rho = 0.3852270390342173
There is a medium positive correlation between property crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_type, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Property crime offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There does appear to be a positive correlation.
total_type = crimes["total_violent"]
print("r = ", stats.pearsonr(avg_price, total_type)[0])
print("rho = ", stats.spearmanr(avg_price, total_type)[0])
r = -0.22601437664047108 rho = -0.12378245872988368
There is a slight/medium negative correlation between violent crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_type, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Violent crime offenses per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
There does appear to be a negative correlation, though it appears less obvious than the property crime appears positive.
Violent crimes appear to be negatively correlated with price and property crime positively correlated. This makes some intuitive sense, as areas with violent crime would not be popular with tourists, but areas of higher population density that attract property crime may still have higher prices.
method vs price ¶The method of the crime will be compared to the price.
crimes["METHOD"].unique()
array(['others', 'gun', 'knife'], dtype=object)
The values for this variable are as expected, so we'll make dummy variables.
crimes = pd.concat([crimes, pd.get_dummies(crimes["METHOD"], prefix="method")], axis=1)
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | total_offense_sex_abuse | total_offense_theft_auto | total_offense_theft_other | off_type_property | off_type_violent | total_property | total_violent | method_gun | method_knife | method_others | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 12 | 521.0 | 772.0 | 0 | 1 | 1720.0 | 214.0 | 0 | 0 | 1 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 3 | 283.0 | 503.0 | 0 | 1 | 933.0 | 112.0 | 0 | 0 | 1 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 2 | 100.0 | 85.0 | 0 | 1 | 235.0 | 51.0 | 1 | 0 | 0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 7 | 135.0 | 213.0 | 0 | 1 | 559.0 | 275.0 | 0 | 0 | 1 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 6 | 516.0 | 703.0 | 1 | 0 | 1452.0 | 193.0 | 0 | 0 | 1 |
5 rows × 40 columns
Summing the values for each method per neighborhood.
crimes["total_gun"] = crimes.groupby("NEIGHBORHOOD_NAME")["method_gun"].transform("sum")
crimes["total_knife"] = crimes.groupby("NEIGHBORHOOD_NAME")["method_knife"].transform("sum")
crimes["total_others"] = crimes.groupby("NEIGHBORHOOD_NAME")["method_others"].transform("sum")
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | off_type_property | off_type_violent | total_property | total_violent | method_gun | method_knife | method_others | total_gun | total_knife | total_others | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 0 | 1 | 1720.0 | 214.0 | 0 | 0 | 1 | 116 | 28 | 1790.0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 1 | 933.0 | 112.0 | 0 | 0 | 1 | 57 | 9 | 979.0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 1 | 235.0 | 51.0 | 1 | 0 | 0 | 42 | 2 | 242.0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 0 | 1 | 559.0 | 275.0 | 0 | 0 | 1 | 190 | 37 | 607.0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 1 | 0 | 1452.0 | 193.0 | 0 | 0 | 1 | 94 | 27 | 1524.0 |
5 rows × 43 columns
total_method = crimes["total_gun"]
print("r = ", stats.pearsonr(avg_price, total_method)[0])
print("rho = ", stats.spearmanr(avg_price, total_method)[0])
r = -0.41188814871022433 rho = -0.2945376995157476
There is a medium negative correlation between gun crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_method, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Gun crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart does appear to be a negative correlation.
total_method = crimes["total_knife"]
print("r = ", stats.pearsonr(avg_price, total_method)[0])
print("rho = ", stats.spearmanr(avg_price, total_method)[0])
r = -0.11715637516060468 rho = -0.0033041250964812737
There is a slight negative correlation between knife crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_method, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Knife crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart appears to have a slight negative correlation.
total_method = crimes["total_others"]
print("r = ", stats.pearsonr(avg_price, total_method)[0])
print("rho = ", stats.spearmanr(avg_price, total_method)[0])
r = 0.34433289936254546 rho = 0.3842336057132542
There is a slight positive correlation between other crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_method, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Other crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart appears to have a positive correlation.
There seems to be a negative correlation between gun and knife crimes and prices and a positive correlation between other method crimes and price. This seems to fit with the previous comparison, that violent crime is negatively correlated with price and property crime positively correlated. There is probably a breakdown to very specific types or ratios of crimes in a neighborhood that are more closely related to price. There is also perhaps more crime/population data that would explain these numbers that are not available in the data we have.
shift vs price ¶The shift variable is what time of day that the crime occurs.
crimes["SHIFT"].unique()
array(['midnight', 'evening', 'day'], dtype=object)
As the values are expected, we will create a dummy variable.
crimes = pd.concat([crimes, pd.get_dummies(crimes["SHIFT"], prefix="shift")], axis=1)
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | total_violent | method_gun | method_knife | method_others | total_gun | total_knife | total_others | shift_day | shift_evening | shift_midnight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 214.0 | 0 | 0 | 1 | 116 | 28 | 1790.0 | 0 | 0 | 1 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 112.0 | 0 | 0 | 1 | 57 | 9 | 979.0 | 0 | 1 | 0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 51.0 | 1 | 0 | 0 | 42 | 2 | 242.0 | 0 | 1 | 0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 275.0 | 0 | 0 | 1 | 190 | 37 | 607.0 | 0 | 0 | 1 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 193.0 | 0 | 0 | 1 | 94 | 27 | 1524.0 | 0 | 1 | 0 |
5 rows × 46 columns
Summing the values for each method per neighborhood.
crimes["total_day"] = crimes.groupby("NEIGHBORHOOD_NAME")["shift_day"].transform("sum")
crimes["total_evening"] = crimes.groupby("NEIGHBORHOOD_NAME")["shift_evening"].transform("sum")
crimes["total_midnight"] = crimes.groupby("NEIGHBORHOOD_NAME")["shift_midnight"].transform("sum")
crimes.head()
| column00 | NEIGHBORHOOD_CLUSTER | offensegroup | LONGITUDE | END_DATE | offense-text | SHIFT | offensekey | START_DATE | OFFENSE | ... | method_others | total_gun | total_knife | total_others | shift_day | shift_evening | shift_midnight | total_day | total_evening | total_midnight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | cluster 25 | violent | -77.002778 | 9/7/2021, 2:18:00 PM | homicide | midnight | violent|homicide | 2021-09-07 | homicide | ... | 1 | 116 | 28 | 1790.0 | 0 | 0 | 1 | 805.0 | 763.0 | 366.0 |
| 1 | 1 | cluster 26 | violent | -76.997326 | 10/3/2021, 3:20:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 1 | 57 | 9 | 979.0 | 0 | 1 | 0 | 491.0 | 390.0 | 164.0 |
| 2 | 2 | cluster 35 | violent | -76.964260 | 10/3/2021, 6:40:00 PM | robbery | evening | violent|robbery | 2021-10-03 | robbery | ... | 0 | 42 | 2 | 242.0 | 0 | 1 | 0 | 136.0 | 107.0 | 43.0 |
| 3 | 3 | cluster 39 | violent | -76.999896 | 10/11/2021, 8:52:00 PM | assault w/dangerous weapon | midnight | violent|assault w/dangerous weapon | 2021-10-11 | assault w/dangerous weapon | ... | 1 | 190 | 37 | 607.0 | 0 | 0 | 1 | 260.0 | 326.0 | 248.0 |
| 4 | 4 | cluster 2 | property | -77.025347 | 10/2/2021, 11:01:00 PM | theft f/auto | evening | property|theft f/auto | 2021-10-02 | theft f/auto | ... | 1 | 94 | 27 | 1524.0 | 0 | 1 | 0 | 689.0 | 619.0 | 337.0 |
5 rows × 49 columns
total_shift = crimes["total_day"]
print("r = ", stats.pearsonr(avg_price, total_shift)[0])
print("rho = ", stats.spearmanr(avg_price, total_shift)[0])
r = 0.3183165635613521 rho = 0.4089085328652436
There is a medium negative correlation between day crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_shift, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Day crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart does appear to be a positive correlation.
total_shift = crimes["total_evening"]
print("r = ", stats.pearsonr(avg_price, total_shift)[0])
print("rho = ", stats.spearmanr(avg_price, total_shift)[0])
r = 0.3088018392887768 rho = 0.337923881227923
There is a medium/slight positive correlation between evening crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_shift, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Evening crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart appears to have a slight positive correlation.
total_shift = crimes["total_midnight"]
print("r = ", stats.pearsonr(avg_price, total_shift)[0])
print("rho = ", stats.spearmanr(avg_price, total_shift)[0])
r = 0.20845253952901144 rho = 0.2663572018757196
There is a slight positive correlation between midnight crimes and average price.
figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.scatter(total_shift, avg_price, marker="o", color="darkslategray")
axes.set_ylabel("Average Price per neighborhood")
axes.set_xlabel("Midnight crimes per neighborhood")
axes.set_title("Scatter Plot")
plt.show()
plt.close()
The chart appears to have a slight positive correlation.
There appears to be a slight positive correlation between each time of day and price. The highest positive correlation is day, then evening, then midnight, which if we assume that more violent crimes are committed at night, makes sense. Overall these numbers are similar to those of the overall crime totals.
The crime data does appear to be correlated with price, but without more information on how population distribution affects both price and crime, it may not be useful in helping to predict prices.
When comparing crime and price it was first necessary to group them both by neighborhood. When thinking about the data, if we hypothesize that crimes in an area affect prices of listings, we are talking about crimes that have happened within a geographic distance as well as a temporal distance prior to the listing date. This was not the end goal of the project and would have involved a lot more data to explore potential relationship factors, so the times of crimes relative to listings were ignored. Instead we took counts or averages of crimes for each neighborhood, as a way to create a metric that could represent a sort of "crime potential" value. As we also grouped the listings into neighborhoods and compared this against the average price per area we are looking at general trends in each area, which, even if there is a relationship between crimes and prices, may not have the nuance to show it.
The specific variables looked at include total crimes (the count of crimes occuring in each neighborhood), the counts of each type of offense, the count of crimes in each offense group (property or violent), the count of crimes of each method (gun, knife, other), and the count of crimes occuring during which shift, or time period (day, evening, midnight).
None of the correlation statitics suggested a strong correlation:
Total crimes showed a lower-medium positive correlation with price.
Homicide had medium negative, auto theft lower-medium positive correlation, and other thefts offences medium positive correlations respectively.
A lower-medium positive correlation with propety crimes, and a lower-medium negative correlation with violent crimes.
Gun crime had a medium negative correlation and the method other types had a lower-medium positive correlation.
Day shift had the highest medium positive correlation, evening shift a lower medium positive, and midnight a little lower than that.
Note: we hypothesize that the positive correlations between crime and price are due in part to population density - more crime is expected in areas with more people, and more people in an area can also increase prices. To confirm this, we would need additional data. The patterns in correlation statistics suggest that in general, crime is associated with higher prices, unless the crime is violent (the offense types with negative correlation, gun crime with negative correlation, the violent crime group with negative correlation). This is why we used violent crime when developing the model - it seemed less likely to have a confounding variable that would be represented by one of the location variables.
The variable price that represents price per night charged for that listing in the database, is the target variable we interested to predict by other related covariates. Since Airbnb is an online community that provides short-term homestays and apartment rental at an amzing price, therefore, the price in database is expected to concentrate within the smaller charges and looks like positive skewed distribution.
full_latest_listings = con.execute("SELECT * from full_latest_listings").df()
full_latest_listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10560 entries, 0 to 10559 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10560 non-null int64 1 listing_url 10560 non-null object 2 scrape_id 10560 non-null int64 3 last_scraped 10560 non-null datetime64[ns] 4 name 10559 non-null object 5 description 10416 non-null object 6 neighborhood_overview 6559 non-null object 7 picture_url 10560 non-null object 8 host_id 10560 non-null int32 9 host_url 10560 non-null object 10 host_name 10398 non-null object 11 host_since 10398 non-null datetime64[ns] 12 host_location 9548 non-null object 13 host_about 6013 non-null object 14 host_response_time 10398 non-null object 15 host_response_rate 7452 non-null float64 16 host_acceptance_rate 8120 non-null float64 17 host_is_superhost 10554 non-null object 18 host_thumbnail_url 10398 non-null object 19 host_picture_url 10398 non-null object 20 host_neighbourhood 9583 non-null object 21 host_listings_count 10398 non-null float64 22 host_total_listings_count 10398 non-null float64 23 host_verifications 10560 non-null object 24 host_has_profile_pic 10398 non-null object 25 host_identity_verified 10398 non-null object 26 neighbourhood 6559 non-null object 27 neighbourhood_cleansed 10560 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 10560 non-null float64 30 longitude 10560 non-null float64 31 property_type 10560 non-null object 32 room_type 10560 non-null object 33 accommodates 10560 non-null int32 34 bathrooms 0 non-null float64 35 bathrooms_text 10546 non-null object 36 bedrooms 9600 non-null float64 37 beds 10291 non-null float64 38 amenities 10560 non-null object 39 price 10560 non-null float64 40 minimum_nights 10560 non-null int32 41 maximum_nights 10560 non-null int32 42 minimum_minimum_nights 10557 non-null float64 43 maximum_minimum_nights 10557 non-null float64 44 minimum_maximum_nights 10557 non-null float64 45 maximum_maximum_nights 10557 non-null float64 46 minimum_nights_avg_ntm 10557 non-null float64 47 maximum_nights_avg_ntm 10557 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 10560 non-null bool 50 availability_30 10560 non-null int32 51 availability_60 10560 non-null int32 52 availability_90 10560 non-null int32 53 availability_365 10560 non-null int32 54 calendar_last_scraped 10560 non-null datetime64[ns] 55 number_of_reviews 10560 non-null int32 56 number_of_reviews_ltm 10560 non-null int32 57 number_of_reviews_l30d 10560 non-null int32 58 first_review 7740 non-null datetime64[ns] 59 last_review 7740 non-null datetime64[ns] 60 review_scores_rating 7740 non-null float64 61 review_scores_accuracy 7656 non-null float64 62 review_scores_cleanliness 7657 non-null float64 63 review_scores_checkin 7654 non-null float64 64 review_scores_communication 7657 non-null float64 65 review_scores_location 7655 non-null float64 66 review_scores_value 7652 non-null float64 67 license 3427 non-null object 68 instant_bookable 10560 non-null bool 69 calculated_host_listings_count 10560 non-null int32 70 calculated_host_listings_count_entire_homes 10560 non-null int32 71 calculated_host_listings_count_private_rooms 10560 non-null int32 72 calculated_host_listings_count_shared_rooms 10560 non-null int32 73 reviews_per_month 7740 non-null float64 74 NEIGHBORHOOD_NAME 10560 non-null object 75 homicides 10560 non-null int64 76 robberies 10560 non-null int64 77 assaults 10560 non-null int64 78 theft_from_auto 10560 non-null int64 79 other_thefts 10560 non-null int64 80 vehicle_theft 10560 non-null int64 81 burglaries 10560 non-null int64 82 sex_abuses 10560 non-null int64 83 arsons 10560 non-null int64 84 violent_crimes 10560 non-null int64 85 property_crimes 10560 non-null int64 86 total_crimes 10560 non-null int64 dtypes: bool(2), datetime64[ns](5), float64(26), int32(15), int64(14), object(25) memory usage: 6.3+ MB
pd.DataFrame(full_latest_listings.price.head())
| price | |
|---|---|
| 0 | 67.0 |
| 1 | 39.0 |
| 2 | 75.0 |
| 3 | 250.0 |
| 4 | 113.0 |
price=pd.DataFrame(full_latest_listings.price)
price.describe()
| price | |
|---|---|
| count | 10560.000000 |
| mean | 198.181818 |
| std | 422.463628 |
| min | 0.000000 |
| 25% | 85.000000 |
| 50% | 129.000000 |
| 75% | 206.000000 |
| max | 24999.000000 |
print("The rows of price over than $1000 is",len(price[price.price>1000]))
np.round(len(price[price.price>1000])/len(price)*100,3)# the ratio of price over than $1000
The rows of price over than $1000 is 137
1.297
figure = plt.figure(figsize=(15,10))
axes = figure.add_subplot(2, 2, 1)
axes.hist(price,color="darkslategray",bins=20,density=True)
axes.set_title("Price histogram distribution ")
axes.set_xlabel("Price($)")
axes = figure.add_subplot(2, 2, 2)
sns.boxplot(np.array(price),color="darkslategray")
axes.set_title("Price boxplot ")
axes.set_xlabel("Price($)")
data_price=price[price.price<1001]
axes = figure.add_subplot(2, 2, 3)
axes.hist(data_price,color="darkslategray",bins=20,density=True)
axes.set_title("Price histogram distribution(less than $1000) ")
axes.set_xlabel("Price($)")
axes = figure.add_subplot(2, 2, 4)
sns.boxplot(np.array(data_price),color="darkslategray")
axes.set_title("Price boxplot(less than $1000) ")
axes.set_xlabel("Price($)")
Text(0.5, 0, 'Price($)')
According to the summary statistics of price and the distribution of price, it can be found that
As per the previous EDA and CLD analysis, there are many factors are estimated to influence the booked price in Airbnb.For instance, the room type, number of beds, property type, and apartment location are all the potential factores to decrease or increase the price of apartment rental. So price is likely to be the outcome of many independent variables affects some working to increase price and some working to decrease price. As a result, price is very likely to follow an approximate normal distribution.
However, since the deviation of price is large as 482, and the histogram shows the positive skewed distribution, we will try logarithmic transformation of price to fit the distribution model alternatively.
We want to be able to predict price of a random booking. Our loss function will be MSE which means that over and under estimates are symmetric and large errors are penalized more than small errors. Because of this, we'll use the mean as our prediction:
price_mean=np.mean(data_price)
price_std=np.std(data_price)
price_mean
price 169.130673 dtype: float64
We predict that the price of a random booking will be $$$171.05.
The 95% error bounds on that estimate are:
price_std
price 139.129992 dtype: float64
price_mean-1.96*price_std,price_mean+1.96*price_std
(price -103.564112 dtype: float64, price 441.825457 dtype: float64)
Since the price cannot be negative, that is, we expect to see a range of values for price from 0 to $$$441.82 about 95% percent of the time.
As per the previous analysis, we first try Normal distribution to fit data.
figure = plt.figure(figsize=(10,6))
axes = figure.add_subplot(1, 1, 1)
n, bins, patches = axes.hist(data_price, color="DimGray", density=True, bins=20, alpha=0.75)
axes.set_xlabel( "Price($)")
axes.set_ylabel( "Density")
axes.set_title("Density Histogram of Price with Normal plot")
xs = [(b2 + b1)/2 for b1, b2 in zip(bins, bins[1:])]
ys = [stats.norm.pdf( k, loc=price_mean, scale=price_std) for k in xs]
axes.plot(xs, ys, color="darkred")
plt.show()
plt.close()
figure = plt.figure(figsize=(20, 8))
price_mn = np.min(data_price)
price_mx = np.max(data_price)
mean = np.mean(data_price)
std = np.std(data_price)
axes = figure.add_subplot(1, 2, 1)
values, base = np.histogram(data_price, bins=20, density=True)
cumulative = np.cumsum(values)
axes.plot(base[:-1], cumulative, color="steelblue")
axes.set_xlim((0, 1000))
sampled_data = [mean + r * std for r in np.random.standard_normal(10000)]
values2, base = np.histogram(sampled_data, bins=base, density=True)
cumulative2 = np.cumsum(values2)
axes.plot( base[:-1], cumulative2, color="firebrick")
axes.set_xlim((0, 1000))
axes.set_xlabel( "Empirical v. Theoretical: Normal Distribution")
axes = figure.add_subplot(1, 2, 2)
differences = cumulative2 - cumulative
axes.plot(base[:-1], differences, color='firebrick')
axes.set_xlim((0, 1000))
axes.hlines(0, 0, 14000, linestyles="dotted")
axes.set_xlabel( "Empirical v. Theoretical: Normal Distribution, Difference")
plt.show()
plt.close()
From above model analysis, we can see the Normal model is not a very perfect model for fitting the price data. The model will predict lower price below approximate $\$$350 and higher price over $$$400. The largest discrepancies between empirical and theoretical difference are around $\$$100 to $$$ 150 but they are very small.
As mentioned in story generation part, since the deviation of price is large and the histogram shows the positive skewed distribution, we will try logarithmic transformation of price to fit Normal distribution model alternatively.
log_price=np.log10(data_price+1)#there are some free charged price, to avoid the infinite value, we add 1 to each price data
log_mean=np.mean(log_price)
log_std=np.std(log_price)
figure = plt.figure(figsize=(10,6))
axes = figure.add_subplot(1, 1, 1)
n, bins, patches = axes.hist(log_price, color="DimGray", density=True, bins=20, alpha=0.75)
axes.set_xlabel( "Log Price($)")
axes.set_ylabel( "Density")
axes.set_title("Density Histogram of Log Price with Normal plot")
xs = [(b2 + b1)/2 for b1, b2 in zip(bins, bins[1:])]
ys = [stats.norm.pdf( k, loc=log_mean, scale=log_std) for k in xs]
axes.plot(xs, ys, color="darkred")
plt.show()
plt.close()
log_mean=np.mean(log_price)
log_std=np.std(log_price)
10**log_mean-1
price 132.786801 dtype: float64
We predict that the price of a random booking will be $$$132.78.
The 95% error bounds on that estimate are:
log_std
price 0.295416 dtype: float64
10**(log_mean-1.96*log_std)-1,10**(log_mean+1.96*log_std)-1
(price 34.26937 dtype: float64, price 506.491576 dtype: float64)
The logarithmic transformation price estimation is more rational compared to the estimation by null model, that is, we expect to see a range of values for price from $\$$ 34.27 to $$$506.49 about 95% percent of the time.
figure = plt.figure(figsize=(20, 8))
price_mn = np.min(log_price)
price_mx = np.max(log_price)
mean = np.mean(log_price)
std = np.std(log_price)
axes = figure.add_subplot(1, 2, 1)
values, base = np.histogram(log_price, bins=20, density=True)
cumulative = np.cumsum(values)
axes.plot(base[:-1], cumulative, color="steelblue")
axes.set_xlim((0, 3))
sampled_data = [mean + r * std for r in np.random.standard_normal(10000)]
values2, base = np.histogram(sampled_data, bins=base, density=True)
cumulative2 = np.cumsum(values2)
axes.plot( base[:-1], cumulative2, color="firebrick")
axes.set_xlim((0, 3))
axes.set_xlabel( "Empirical v. Theoretical: Normal Distribution")
axes = figure.add_subplot(1, 2, 2)
differences = cumulative2 - cumulative
axes.plot(base[:-1], differences, color='firebrick')
axes.set_xlim((0, 3))
axes.hlines(0, 0, 14000, linestyles="dotted")
axes.set_xlabel( "Empirical v. Theoretical: Normal Distribution, Difference")
plt.show()
plt.close()
The blue line (Normal distribution of Log Price) looks like a pretty good (smooth) approximation of the red line (Empirical distribution). The Normal distributional model with logarithmic transformation of price can fit the practical price data better compared to the previous model.
The largest discrepency is at $\$$ 50 (10<sup>1.7</sup>) approximately, the differences between fitted price and practical price mainly range from $$$ 50 to $$$ 200 (102.3) according to the Normal Distribution Difference linechart.
What is the probability of the Price less than $$$ 400 in Airbnb ?
stats.norm.cdf(np.log10(400), log_mean, log_std)
array([0.94631097])
There is 94.6% probability of a (random) price less than $$$400 in Airbnb from the last quarter.
Based on previous EDA and CLD analysis, we know price is likely to be the outcome of many independent variables affects some working to increase price and some working to decrease price. As per the domain knowledge, we assume price is very likely to follow an approximate normal distribution.
Two kinds of Null model(Mean model) and Distributional model are constructed by Price and logarithmic transformation of Price.
Normal model of Price is not a very perfect model for fitting the price data. The model will predict lower price below approximate $\$$ 350 and higher price over $$$ 400. The largest discrepancies between empirical and theoretical difference are around $\$$ 100 to $$$ 150.
Normal distribution of Log Price looks like a pretty good approximation of the Empirical distribution. The Normal distributional model with logarithmic transformation of price can fit the practical price data better.
Range of values for random booked price is from $\$$34.27 to $$$506.49 about 95% percent of the time.
The largest discrepency is at $\$$ 50 (10<sup>1.7</sup>) approximately, the differences between fitted price and practical price mainly range from $$$ 50 to $$$ 200 (102.3) according to the Normal Distribution Difference linechart.
We can predict that the probability of price less than $$$ 400 booked in Airbnb from the last quarter is 94.6%
full_latest_listings = con.execute("SELECT * from full_latest_listings").df()
# Code from Lab 10 Solution
def correlations(data, y, xs):
rs = []
rhos = []
for x in xs:
r = stats.pearsonr(data[y], data[x])[0]
rs.append(r)
rho = stats.spearmanr(data[y], data[x])[0]
rhos.append(rho)
return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})
import models
# Code from Lab 10 Solution
def plot_residuals(result, df, variables):
height = (len(variables) // 3) * 6
figure = plt.figure(figsize=(20,height))
plots = len( variables)
rows = (plots // 3) + 1
residuals = np.array([r[0] for r in result["residuals"]])
limits = max(np.abs(residuals.min()), residuals.max())
n = result["n"]
for i, variable in enumerate( variables):
axes = figure.add_subplot(rows, 3, i + 1)
keyed_values = sorted(zip(df[variable].values, residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
axes.plot(list(range(0, n)), ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-limits, limits))
axes.set_ylabel("residuals")
axes.set_xlabel(variable)
plt.show()
plt.close()
return residuals
# Code from Lab 11 Solution
def chunk(xs, n):
k, m = divmod(len(xs), n)
return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]
# Code from Lab 11 Solution
def cross_validation(algorithm, formula, data, evaluate, fold_count=10, repetitions=3):
indices = list(range(len( data)))
metrics = []
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[fold]
train_indices = [idx not in fold for idx in indices]
train_data = data.iloc[train_indices]
result = algorithm(formula, data=train_data)
model = result["model"]
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
# y = np.ravel( y) # might need for logistic regression
results = models.summarize(formula, X, y, model)
metric = evaluate(results)
metrics.append(metric)
return metrics
from collections import defaultdict
# Code from Lab 11 Solution
def data_collection():
result = dict()
result[ "train"] = defaultdict( list)
result[ "test"] = defaultdict( list)
return result
# Code from Lab 11 Solution
def learning_curves(algorithm, formula, data, evaluate, fold_count=10, repetitions=3, increment=1):
indices = list(range(len( data)))
results = data_collection()
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[ fold]
train_indices = [idx for idx in indices if idx not in fold]
train_data = data.iloc[train_indices]
for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
# the indices are already shuffled so we only need to take ever increasing chunks
train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
# we calculate the model
result = algorithm(formula, data=train_data_chunk)
model = result["model"]
# we calculate the results for the training data subset
y, X = patsy.dmatrices( formula, train_data_chunk, return_type="matrix")
result = models.summarize(formula, X, y, model)
metric = evaluate(result)
results["train"][i].append( metric)
# we calculate the results for the test data.
y, X = patsy.dmatrices( formula, test_data, return_type="matrix")
result = models.summarize(formula, X, y, model)
metric = evaluate(result)
results["test"][i].append( metric)
#
#
# process results
# Rely on the CLT...
statistics = {}
for k, v in results["train"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["train"] = statistics
statistics = {}
for k, v in results["test"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["test"] = statistics
return results
#
# Code from Lab 11 Solution
def results_to_curves( curve, results):
all_statistics = results[ curve]
keys = list( all_statistics.keys())
keys.sort()
mean = []
upper = []
lower = []
for k in keys:
m, s = all_statistics[ k]
mean.append( m)
upper.append( m + 2 * s)
lower.append( m - 2 * s)
return keys, lower, mean, upper
# Code from Lab 11 Solution
def plot_learning_curves( results, metric, zoom=False):
figure = plt.figure(figsize=(10,6))
axes = figure.add_subplot(1, 1, 1)
xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
_, test_lower, test_mean, test_upper = results_to_curves( "test", results)
axes.plot( xs, train_mean, color="steelblue")
axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
axes.plot( xs, test_mean, color="firebrick")
axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
axes.legend()
axes.set_xlabel( "training set (%)")
axes.set_ylabel( metric)
axes.set_title("Learning Curves")
if zoom:
y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
axes.set_ylim((y_lower, y_upper))
plt.show()
plt.close()
#
Our team used
To come up with the following list of variables for our final regression model:
host_listings_countlatitudelongitude"room_typeaccommodatesbedroomsbedsminimum_nightsmaximum_nightsnumber_of_reviews_ltmnumber_of_reviewsreview_scores_ratingreview_scores_accuracyreview_scores_cleanlinessreview_scores_checkinreview_scores_communicationreview_scores_locationreview_scores_valuereviews_per_monthviolent_crimeshost_identity_verifiedLet's filter our latest listings dataframe to include only rows that have non-null/non-NaN values for these variables.
df = full_latest_listings[full_latest_listings[["host_listings_count",
"latitude",
"longitude",
"room_type",
"accommodates",
"bedrooms",
"beds",
"minimum_nights",
"maximum_nights",
"number_of_reviews_ltm",
"number_of_reviews",
"review_scores_rating",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_location",
"review_scores_value",
"reviews_per_month",
"violent_crimes",
"host_identity_verified"]].notnull().all(1)]
df = df[["price",
"host_listings_count",
"latitude",
"longitude",
"room_type",
"accommodates",
"bedrooms",
"beds",
"minimum_nights",
"maximum_nights",
"number_of_reviews_ltm",
"number_of_reviews",
"review_scores_rating",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_location",
"review_scores_value",
"reviews_per_month",
"violent_crimes",
"host_identity_verified"]].copy()
We also know from our EDA and queries that price has a handful of extreme outliers. The above filtering might have already eliminated those outliers. Let's remove all listings with a nightly price greater than $1000 and see the percentage of listings that are removed in the process.
We chose 1000 because we found that it was reasonably far beyond the 25th, 50th, and 75th percentile to categorize it as an outlier while not being too large so as to hurt the performance of our model for the wide majority of prices which were between 50 and 250. We expect that if we were to field this regression model as a product, we could reasonably expect most users wanting to predict the price of their AirBnB to be satisfied with a price cap at 1000.
df = df[df.price <= 1000]
This removed 36 / 6751 = 0.5% of entries which is a tolerable amount. Our EDA of price also supports this decision because the mean price before filtering was: 201 USD and the 25th, 50th, and the 75th percentile were 85, 127 and 201, respectively.
Next, we need to convert all cateogorical variables into one-hot-encodings in order to be useable in the regression.
# Encode categorical variables into one-hot-encodings
df = pd.concat([df, pd.get_dummies(df["room_type"], prefix='room_type')], axis=1)
df = pd.concat([df, pd.get_dummies(df["host_identity_verified"], prefix='host_identity_verified')], axis=1)
The new column names created here might have resulted in spaces in the column names so lets fix that:
df.columns = df.columns.str.replace(', ','_')
df.columns = df.columns.str.replace(' ','_')
df.columns = df.columns.str.replace('\._','_')
df.columns = df.columns.str.replace('/|-','')
We are going to be modeling price so it is important to understand the distribution of that variable. We know from our EDA and the creation of the null model that price has a skewed distribution. It would be beneficial to transform price into log-price so that the distribution approached a symmetric, Normal distribution.
df["log_price"] = df["price"].apply(np.log)
figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(df.log_price)
axes.set_title("Log(price) Histogram")
plt.show()
plt.close()
This distribution looks more symmetric. We will be predicting log_price with our model.
We no longer need the "object" type variables as they have been replaced by numerical one-hot encodings:
df = df.select_dtypes(exclude=['object'])
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6715 entries, 0 to 10549 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 6715 non-null float64 1 host_listings_count 6715 non-null float64 2 latitude 6715 non-null float64 3 longitude 6715 non-null float64 4 accommodates 6715 non-null int32 5 bedrooms 6715 non-null float64 6 beds 6715 non-null float64 7 minimum_nights 6715 non-null int32 8 maximum_nights 6715 non-null int32 9 number_of_reviews_ltm 6715 non-null int32 10 number_of_reviews 6715 non-null int32 11 review_scores_rating 6715 non-null float64 12 review_scores_accuracy 6715 non-null float64 13 review_scores_cleanliness 6715 non-null float64 14 review_scores_checkin 6715 non-null float64 15 review_scores_communication 6715 non-null float64 16 review_scores_location 6715 non-null float64 17 review_scores_value 6715 non-null float64 18 reviews_per_month 6715 non-null float64 19 violent_crimes 6715 non-null int64 20 room_type_Entire_homeapt 6715 non-null uint8 21 room_type_Hotel_room 6715 non-null uint8 22 room_type_Private_room 6715 non-null uint8 23 room_type_Shared_room 6715 non-null uint8 24 host_identity_verified_False 6715 non-null uint8 25 host_identity_verified_True 6715 non-null uint8 26 log_price 6715 non-null float64 dtypes: float64(15), int32(5), int64(1), uint8(6) memory usage: 1.0 MB
When creating the actual model, we must remove one one-hot-encoded categorical variables so that the y-intercept of the model is interpretable:
model_variables = ['host_listings_count',
'latitude',
'longitude',
'accommodates',
'bedrooms',
'beds',
'minimum_nights',
'maximum_nights',
'number_of_reviews_ltm',
'number_of_reviews',
'review_scores_rating',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'review_scores_value',
'reviews_per_month',
'violent_crimes',
'room_type_Entire_homeapt',
'room_type_Hotel_room',
'room_type_Private_room',
# 'room_type_Shared_room',
'host_identity_verified_False',
# 'host_identity_verified_True',
]
Now let's look at the correlations coefficients for each variable in or model with log_price.
correlations(df, "log_price", df.columns.values.tolist())
| feature | r | rho | |
|---|---|---|---|
| 0 | price | 0.899914 | 1.000000 |
| 1 | host_listings_count | 0.082657 | 0.028360 |
| 2 | latitude | -0.091517 | -0.133721 |
| 3 | longitude | -0.149263 | -0.138889 |
| 4 | accommodates | 0.593653 | 0.622281 |
| 5 | bedrooms | 0.596898 | 0.599590 |
| 6 | beds | 0.464350 | 0.535848 |
| 7 | minimum_nights | 0.004228 | 0.014154 |
| 8 | maximum_nights | 0.019535 | 0.083456 |
| 9 | number_of_reviews_ltm | 0.086171 | 0.148961 |
| 10 | number_of_reviews | -0.007513 | 0.004874 |
| 11 | review_scores_rating | 0.063512 | 0.080738 |
| 12 | review_scores_accuracy | 0.035579 | 0.055293 |
| 13 | review_scores_cleanliness | 0.103900 | 0.113144 |
| 14 | review_scores_checkin | 0.001992 | 0.029995 |
| 15 | review_scores_communication | -0.007127 | 0.019137 |
| 16 | review_scores_location | 0.138357 | 0.203495 |
| 17 | review_scores_value | -0.006496 | -0.024706 |
| 18 | reviews_per_month | 0.059912 | 0.094734 |
| 19 | violent_crimes | -0.074163 | -0.061904 |
| 20 | room_type_Entire_homeapt | 0.506162 | 0.523631 |
| 21 | room_type_Hotel_room | -0.090992 | -0.063876 |
| 22 | room_type_Private_room | -0.456913 | -0.482954 |
| 23 | room_type_Shared_room | -0.181160 | -0.163918 |
| 24 | host_identity_verified_False | -0.013812 | -0.021373 |
| 25 | host_identity_verified_True | 0.013812 | 0.021373 |
| 26 | log_price | 1.000000 | 1.000000 |
And here is the correlations between all the model variabels and regular price.
correlations(df, "price", df.columns.values.tolist())
| feature | r | rho | |
|---|---|---|---|
| 0 | price | 1.000000 | 1.000000 |
| 1 | host_listings_count | 0.056107 | 0.028360 |
| 2 | latitude | -0.082851 | -0.133721 |
| 3 | longitude | -0.117552 | -0.138889 |
| 4 | accommodates | 0.573274 | 0.622281 |
| 5 | bedrooms | 0.615569 | 0.599590 |
| 6 | beds | 0.478420 | 0.535848 |
| 7 | minimum_nights | 0.020790 | 0.014154 |
| 8 | maximum_nights | 0.018611 | 0.083456 |
| 9 | number_of_reviews_ltm | 0.020588 | 0.148961 |
| 10 | number_of_reviews | -0.053327 | 0.004874 |
| 11 | review_scores_rating | 0.040370 | 0.080738 |
| 12 | review_scores_accuracy | 0.008794 | 0.055293 |
| 13 | review_scores_cleanliness | 0.059820 | 0.113144 |
| 14 | review_scores_checkin | -0.011653 | 0.029995 |
| 15 | review_scores_communication | -0.018098 | 0.019137 |
| 16 | review_scores_location | 0.091118 | 0.203495 |
| 17 | review_scores_value | -0.027861 | -0.024706 |
| 18 | reviews_per_month | -0.003883 | 0.094734 |
| 19 | violent_crimes | -0.073404 | -0.061904 |
| 20 | room_type_Entire_homeapt | 0.327073 | 0.523631 |
| 21 | room_type_Hotel_room | -0.036810 | -0.063876 |
| 22 | room_type_Private_room | -0.302324 | -0.482954 |
| 23 | room_type_Shared_room | -0.101439 | -0.163918 |
| 24 | host_identity_verified_False | -0.008370 | -0.021373 |
| 25 | host_identity_verified_True | 0.008370 | 0.021373 |
| 26 | log_price | 0.899914 | 1.000000 |
Note that rho or the Spearman coefficient does not change between log_price and price whereas r or the Pearson's correlation coefficient does change. This is because Pearson's measures the linear dependence between the variable which changes when we apply a transformation like log to one of the variables.
model_string = "log_price ~ " + ' + '.join(model_variables)
result = models.bootstrap_linear_regression(model_string, data=df)
models.describe_bootstrap_lr(result)
Model: log_price ~ host_listings_count + latitude + longitude + accommodates + bedrooms + beds + minimum_nights + maximum_nights + number_of_reviews_ltm + number_of_reviews + review_scores_rating + review_scores_accuracy + review_scores_cleanliness + review_scores_checkin + review_scores_communication + review_scores_location + review_scores_value + reviews_per_month + violent_crimes + room_type_Entire_homeapt + room_type_Hotel_room + room_type_Private_room + host_identity_verified_False
| 95% BCI | ||||
| Coefficients | Mean | Lo | Hi | |
| $\beta_{0}$ | -106.70 | -132.39 | -75.19 | |
| host_listings_count | $\beta_{1}$ | 0.00 | 0.00 | 0.00 |
| latitude | $\beta_{2}$ | -3.58 | -4.02 | -3.18 |
| longitude | $\beta_{3}$ | -3.23 | -3.59 | -2.84 |
| accommodates | $\beta_{4}$ | 0.07 | 0.06 | 0.08 |
| bedrooms | $\beta_{5}$ | 0.27 | 0.24 | 0.29 |
| beds | $\beta_{6}$ | -0.03 | -0.05 | -0.02 |
| minimum_nights | $\beta_{7}$ | 0.00 | -0.00 | 0.00 |
| maximum_nights | $\beta_{8}$ | 0.00 | -0.00 | 0.00 |
| number_of_reviews_ltm | $\beta_{9}$ | 0.00 | 0.00 | 0.00 |
| number_of_reviews | $\beta_{10}$ | -0.00 | -0.00 | -0.00 |
| review_scores_rating | $\beta_{11}$ | 0.17 | 0.12 | 0.22 |
| review_scores_accuracy | $\beta_{12}$ | -0.05 | -0.13 | 0.01 |
| review_scores_cleanliness | $\beta_{13}$ | 0.18 | 0.13 | 0.22 |
| review_scores_checkin | $\beta_{14}$ | -0.13 | -0.20 | -0.06 |
| review_scores_communication | $\beta_{15}$ | -0.08 | -0.14 | -0.01 |
| review_scores_location | $\beta_{16}$ | 0.21 | 0.17 | 0.26 |
| review_scores_value | $\beta_{17}$ | -0.21 | -0.26 | -0.14 |
| reviews_per_month | $\beta_{18}$ | 0.00 | -0.01 | 0.01 |
| violent_crimes | $\beta_{19}$ | -0.00 | -0.00 | -0.00 |
| room_type_Entire_homeapt | $\beta_{20}$ | 0.83 | 0.74 | 0.92 |
| room_type_Hotel_room | $\beta_{21}$ | -0.21 | -0.59 | 0.14 |
| room_type_Private_room | $\beta_{22}$ | 0.47 | 0.38 | 0.55 |
| host_identity_verified_False | $\beta_{23}$ | 0.01 | -0.02 | 0.04 |
| Metrics | Mean | Lo | Hi | |
| $\sigma$ | 0.43 | 0.42 | 0.44 | |
| $R^2$ | 0.56 | 0.55 | 0.58 |
We have an $R^2$ of 0.56 which is fair performance for a real-world model. The $\sigma$ value is 0.43 in log space. That means that we can get a 95% CI on the mean price of:
(np.exp(np.log(171.046379) - (1.96 * 0.43)), np.exp(np.log(171.046379) + (1.96 * 0.43)))
(73.63605194038358, 397.31711570710326)
Where the mean of 171 comes from the mean price for latest listings below 1000.
Compared to the null model, which had a standard deviation (not in log space) of 144.426415, the interval for the mean would be:
(171.046379 - 1.96*144.426415, 171.046379 + 1.96*144.426415)
(-112.0293944, 454.1221524)
Comparing these two confidence intervals, we can say that our model yields tighter, more sensical confidence intervals than the null model.
The values of the coefficients hold valuable information as well:
log_price.Let's now look at the residual plots for each of the variables in our model. We hope to see a symmetric distribution centered at 0 with a denser collection of points closer to the 0 line than further away. We also want to see a uniform distribution across the x-axis a.k.a. no heteroskedacity in the residuals.
residuals = plot_residuals(result, df, model_variables)
Each of these plots looks in line with the desired distribution of residual points.
We wanted to understand how our model would perform more generally on data outside of the training set. To do this, we used cross validation. The below results are from 3 rounds of 10-fold cross validation:
print(r"95% CI for sigma:", (0.41461169, 0.47870567))
95% CI for sigma: (0.41461169, 0.47870567)
print(r"95% CI for R^2:", (0.46657763, 0.62646393))
95% CI for R^2: (0.46657763, 0.62646393)
We can also calculate the mean $\sigma$ and $R^2$ and calculate the 95% credible intervals for the mean values. This is useful for understanding the average performance of our model if we were to field it.
Let's start by seeing what the mean values are:
print("mean sigma: ", 0.4422992878849424)
print("mean R^2: ", 0.5607481871543466)
mean sigma: 0.4422992878849424 mean R^2: 0.5607481871543466
We also used the Bootstrap technique to get confidence intervals on our mean $\sigma$ and $R^2$.
print(r"95% CI for *mean* sigma:", (0.43830355, 0.44850006))
95% CI for *mean* sigma: (0.43830355, 0.44850006)
print(r"95% CI for *mean* R^2:", (0.54619598, 0.57009005))
95% CI for *mean* R^2: (0.54619598, 0.57009005)
We also wanted to understand if more data would help our model. That is, would we see a decreased value of $\sigma$ if we had used more data in our regression. To do this, we used learning curves and below are the results:
import random
import patsy
lc_result_sigma = learning_curves(models.linear_regression, model_string, df, lambda r: r["sigma"])
plot_learning_curves(lc_result_sigma, r"$\sigma$")
The deviation has mostly converged with some slightly sporadic changes in the value of $\sigma$ across the training set %.
We made 3 predictions with our regression model. The three test cases are tabulated below:
from tabulate import tabulate
table = [['Variable', 'Georgetown Mansion', 'Northeast Room', 'Average Listing'],
['host_listings_count',1,5,1.5],
['latitude',38.905,38.940,38.92],
['longitude',-77.075,-76.975,-77],
['accommodates',12,2,6],
['bedrooms',6,1,3],
['beds',8,1,3],
['minimum_nights',3,1,4],
['maximum_nights',365,1125,1125],
['number_of_reviews_ltm',8,12,10],
['number_of_reviews',30,48,39],
['review_scores_rating',4.89,4.5,4.68],
['review_scores_accuracy',4.75,4.8,4.79],
['review_scores_cleanliness',4.71,4.75,4.73],
['review_scores_checkin',4.82,4.88,4.85],
['review_scores_communication',4.81,4.85,4.83],
['review_scores_location',5,4.75,4.78],
['review_scores_value',5,4.6,4.7],
['reviews_per_month',1.5,1.7,1.6],
['violent_crimes',4,200,50],
['room_type_Entire_homeapt',1,0,1],
['room_type_Hotel_room',0,0,0],
['room_type_Private_room',0,0,0],
['room_type_Shared_room',0,1,0],
['host_identity_verified_False',0,1,0],
['host_identity_verified_True',1,0,1]
]
print(tabulate(table,headers="firstrow"))
Variable Georgetown Mansion Northeast Room Average Listing ---------------------------- -------------------- ---------------- ----------------- host_listings_count 1 5 1.5 latitude 38.905 38.94 38.92 longitude -77.075 -76.975 -77 accommodates 12 2 6 bedrooms 6 1 3 beds 8 1 3 minimum_nights 3 1 4 maximum_nights 365 1125 1125 number_of_reviews_ltm 8 12 10 number_of_reviews 30 48 39 review_scores_rating 4.89 4.5 4.68 review_scores_accuracy 4.75 4.8 4.79 review_scores_cleanliness 4.71 4.75 4.73 review_scores_checkin 4.82 4.88 4.85 review_scores_communication 4.81 4.85 4.83 review_scores_location 5 4.75 4.78 review_scores_value 5 4.6 4.7 reviews_per_month 1.5 1.7 1.6 violent_crimes 4 200 50 room_type_Entire_homeapt 1 0 1 room_type_Hotel_room 0 0 0 room_type_Private_room 0 0 0 room_type_Shared_room 0 1 0 host_identity_verified_False 0 1 0 host_identity_verified_True 1 0 1
airbnbs = [['intercept',1,1,1],
['host_listings_count',1,5,1.5],
['latitude',38.905,38.940,38.92],
['longitude',-77.075,-76.975,-77],
['accommodates',12,2,6],
['bedrooms',6,1,3],
['beds',8,1,3],
['minimum_nights',3,1,4],
['maximum_nights',365,1125,1125],
['number_of_reviews_ltm',8,12,10],
['number_of_reviews',30,48,39],
['review_scores_rating',4.89,4.5,4.68],
['review_scores_accuracy',4.75,4.8,4.79],
['review_scores_cleanliness',4.71,4.75,4.73],
['review_scores_checkin',4.82,4.88,4.85],
['review_scores_communication',4.81,4.85,4.83],
['review_scores_location',5,4.75,4.78],
['review_scores_value',5,4.6,4.7],
['reviews_per_month',1.5,1.7,1.6],
['violent_crimes',4,200,50],
['room_type_Entire_homeapt',1,0,1],
['room_type_Hotel_room',0,0,0],
['room_type_Private_room',0,0,0],
# ['room_type_Shared_room',0,1,0],
['host_identity_verified_False',0,1,0],
# ['host_identity_verified_True',1,0,1]
]
georgetown_mansion = [var[1] for var in airbnbs]
northeast_room = [var[2] for var in airbnbs]
average_listing = [var[3] for var in airbnbs]
This example was created using the values for all variables that we know contribute positively to price. Namely, the location being in the NW area of DC near Georgetown and having a high value for accomodation, beds, and rooms. It is also listed as a home/apt room type and has favorable reviews.
prediction = result["model"].predict([georgetown_mansion])
np.exp(prediction)
array([[869.42274248]])
\$869.42 per night. With an error of 0.43 in log space, this leads to error bounds of:
np.exp(prediction[0][0] - 1.96 * (result["sigma"])), np.exp(prediction[0][0] + 1.96 * (result["sigma"]))
(370.7315553438238, 2038.9305799255333)
Our 95% confidence interval for a listings with these characteristics is (370.73, 2038.93).
This example was created using values of all variables that we know contribute negatively to price. This listing is located in Northeast DC, accomodates only 2 people in 1 room with 1 bed, has fair or slightly below average ratings, a high violent crime rate, and is listed as a shared room as opposed to an entire home/suite.
prediction = result["model"].predict([northeast_room])
np.exp(prediction)
array([[36.663587]])
\$36.66 per night. With an error of 0.43 in log space, this leads to error bounds of:
np.exp(prediction[0][0] - 1.96 * (result["sigma"])), np.exp(prediction[0][0] + 1.96 * (result["sigma"]))
(15.633762460182966, 85.98177279459344)
Our 95% confidence interval for a listings with these characteristics is (15.63, 85.98).
This example was created using average values for all numerical variables in the model and our best attempt at averaging our the cateogorical variables. This listing has 3 bedroom and 3 beds and hosts 6 people. It has average reviews in every category.
prediction = result["model"].predict([average_listing])
np.exp(prediction)
array([[217.45611794]])
\$217.46 per night. With an error of 0.43 in log space this leads to error bounds of:
np.exp(prediction[0][0] - 1.96 * (result["sigma"])), np.exp(prediction[0][0] + 1.96 * (result["sigma"]))
(92.72571429652014, 509.96817427443943)
Our 95% confidence interval for a listings with these characteristics is (92.73, 509.97).
Circling back to the goal of informing an AirBNB lister, we can use the regression model data illustrate a few things. Some of these are obvious, but it is comforting to see them rooted in data, as opposed to just relying on conventional wisdom.
When choosing a property for an AirBNB rental, the increasing the following attributes can help drive decisions and set expectations.
We can also use the regression model as a prediction tool, hopefully allowing us to set prices more aligned with the market than going through the trial and error of setting a price, seeing if is booked, and adjusting. The three cases shown above may be helpful in guiding an AirBNB lister in setting a price.
# write to html
import os
os.system('jupyter nbconvert --to html report.ipynb')
[NbConvertApp] Converting notebook report.ipynb to html
[NbConvertApp] ERROR | Notebook JSON is invalid: data.cells[{data__cells_x}] must be valid exactly by one definition (0 matches found)
Failed validating <unset> in notebook['data']['cells']:
On instance:
<unset>
[NbConvertApp] Writing 9011262 bytes to report.html
0